flowchart TD
A("Data Cleaning / Wrangling")
B("Imputation")
C("Feature Selection")
subgraph unsupervised ["Unsupervised ML"]
subgraph clustering ["Clustering"]
D("KMeans")
E("AgglomerativeClustering")
end
subgraph dimred ["Dimensionality Reduction"]
L("Principal Component Analysis")
end
end
subgraph supervised ["Supervised Regression"]
F("Linear Regression")
G(AdaBoostRegressor)
H(XGBRegressor)
end
J["Analysis"]
K["Selection"]
A --> B --> C
C --> unsupervised --> supervised --> J --> K
Unsupervised Real Estate Price Prediction - Final Report - Deliverable 1
DTSA 5510 Unsupervised Algorithms in Machine Learning - University of Colorado Boulder
1 Introduction
1.1 Project Topic
This project utilizes unsupervised and supervised machine learning (ML) algorithms to perform price prediction on real-world real estate listings downloaded from Zillow.com and with additional listing information scraped from Redfin.com. To start Section 3 describes the source of the data and introduces the dataset. Section 4 uses common techniques to clean the dataset, removing outliers, and sanitizing columns, building a dataset that can be input into varied ML algorithms. Section 5 explores the dataset and produces visualizations for the author and audience to gain an initial understanding of the data. This section also explores the interactions between features and begins to analyze collinearity.
In Section 6 we dive deep into both unsupervised and supervised ML algorithms and build models to cluster, then regress, on the data, aiming to create subsets of data that more accurately model price than one large dataset. Section 7 discusses the results of the modeling and aims to select a best path for price prediction. Finally Section 8 details the outcomes of this project and discusses areas of further research. All files and code for this project can be found here: https://github.com/simmsa/dtsa_5510_final.
To narrow down the listings, provide simpler data collection, and make the result more relevant for the author, this document focuses on real estate listings in the Denver Colorado metropolitan area. As this project utilizes a relatively small amount of real world data, the outcomes may not be definitive. Through this project we will aim to highlight the shortcomings of this analysis and discuss methods for improvement of data size and data quality. As we build our models we are aiming to find a good balance of accuracy and execution speed. We want to be careful of creating a model that is simple for this use case, but too complex for a larger problem.
1.2 Project Goal
The goal of this project is to build pricing model to predict real estate listing prices on new real estate listings to seek out listings that may be undervalued. To achieve this goal we plan to use a ML divide and conquer strategy, which uses clustering and principal component techniques to reduce the data into smaller groups. These smaller groups will be passed into many supervised regression algorithms and the accuracy will be combined to determine the best performing combination of models.
For this dataset, our target variable is price, a floating-point value. To model this variable regression algorithms are a suitable choice. It is worth noting that typical regression algorithms are compatible with floating-point features. We plan to convert as much of the data as possible to floating point values. As this aspect is not the primary focus of this project, certain features may be omitted to streamline the construction and analysis of the models. Given this, we strive to keep the core values of the data. And our goal for cleaning and processing the data is to build a unified source of truth that will be utilized by all ML models.
1.3 Project Plan
In Figure 1 we outline the steps necessary to reach our goal of predicting the price of a real estate listing.
2 Python Setup
The following code section includes the python libraries used to execute the code contained in this document. See Section 9 for complete details on the environment and libraries used to execute the code in this document.
import itertools
import copy
import time
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.cluster import AgglomerativeClustering, KMeans
from sklearn.decomposition import PCA
from sklearn.ensemble import AdaBoostRegressor
from sklearn.impute import KNNImputer
from sklearn.linear_model import LinearRegression, RidgeCV
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_percentage_error, silhouette_score
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.preprocessing import StandardScaler
import tensorflow as tf
import xgboost as xgb
sns.set_theme()2.1 Visualization Functions
def brand_plot(has_subplots=False):
"""
Add branding elements to matplotlib.plt objects with a title and copyright
Parameters:
- has_subplots (bool): Whether the plot has subplots. Default is False.
Returns:
None
"""
if not has_subplots:
plt.suptitle("Real Estate Listing Price Prediction")
txt = "DTSA 5510 - 2023 Summer 1 - Andrew Simms"
plt.figtext(
0.95,
-0.01,
txt,
wrap=True,
horizontalalignment="right",
fontsize=6,
fontstyle="italic",
)
plt.tight_layout()3 Data Information
3.1 Data Source
In the United States (US) real estate listing data is typically not readily available to the consumer. Typically the Multiple Listing Service (MLS) contains these listings and which they make available to realtors. To provide a product, companies like Zillow and Redfin acquire these listings and provide a service that makes it easy for users to find real estate properties for sale. Ideally, to source real estate listing data, we would use MLS, but they do not offer a readily available API. But we can use data from Zillow and Redfin, we just have to find methods for scraping each respective site.
Data is acquired from Zillow using our custom made scraper.py python script. This script interfaces with the Zillow GetSearchPageState API and downloads current real estate listings by zip code for the included zip codes. A framework for this type of scraping can be found in this Zillow web scraping tutorial. This json is then formatted with formatter.py and saved to a csv file.
To add additional features we use the addresses from Zillow and query additional data from Redfin using the Python-Redfin library. Querying Redfin adds additional information about school districts, home features, neighborhood, and listing condition that should be able add additional features that improve the pricing model. This data is combined with the Zillow data and save into a csv file.
3.1.1 Data Acquisition Flowchart
flowchart LR
subgraph daq ["Data Acquisition"]
direction LR
subgraph zillow ["Zillow.com"]
A["scraper.py"]
B["GetSearchPageState API"]
C["Zillow Listings"]
D["json"]
E["formatter.py"]
F["csv"]
end
subgraph redfin ["Redfin.com"]
G["redfin_scraper.py"]
H["csv"]
end
K["Final Output"]
A --> B --> C --> D --> E --> F
F -- Addresses --> G --> H --> K
end
The following code reads the saved csv dataset into a pandas DataFrame object. In Section 3.2 columns 0 to 87 originate from Zillow and columns 88 and above are from Redfin.
df = pd.read_csv("2023_04_22-17_57_52_10_mi_radius_unique_denver_area_w_redfin.csv")3.2 Data Description
The initial data is shown using df.info() as executed below:
df.info(verbose=True, show_counts=True)<class 'pandas.core.frame.DataFrame'>
RangeIndex: 718 entries, 0 to 717
Data columns (total 161 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 718 non-null int64
1 zpid 718 non-null int64
2 price 718 non-null object
3 priceLabel 718 non-null object
4 beds 714 non-null float64
5 baths 718 non-null float64
6 area 715 non-null float64
7 statusType 718 non-null object
8 statusText 718 non-null object
9 isFavorite 718 non-null bool
10 isUserClaimingOwner 718 non-null bool
11 isUserConfirmedClaim 718 non-null bool
12 imgSrc 718 non-null object
13 hasImage 718 non-null bool
14 visited 718 non-null bool
15 listingType 26 non-null object
16 shouldShowZestimateAsPrice 718 non-null bool
17 detailUrl 718 non-null object
18 pgapt 718 non-null object
19 sgapt 718 non-null object
20 has3DModel 718 non-null bool
21 hasVideo 718 non-null bool
22 isHomeRec 718 non-null bool
23 address 718 non-null object
24 info3String 708 non-null object
25 info1String 708 non-null object
26 brokerName 708 non-null object
27 hasAdditionalAttributions 718 non-null bool
28 isFeaturedListing 718 non-null bool
29 isShowcaseListing 718 non-null bool
30 availabilityDate 0 non-null float64
31 timeOnZillow 718 non-null int64
32 latLong_latitude 718 non-null float64
33 latLong_longitude 718 non-null float64
34 variableData_type 423 non-null object
35 variableData_text 423 non-null object
36 variableData_data_isFresh 18 non-null object
37 hdpData_homeInfo_zpid 718 non-null int64
38 hdpData_homeInfo_zipcode 718 non-null int64
39 hdpData_homeInfo_city 718 non-null object
40 hdpData_homeInfo_state 718 non-null object
41 hdpData_homeInfo_latitude 718 non-null float64
42 hdpData_homeInfo_longitude 718 non-null float64
43 hdpData_homeInfo_price 718 non-null float64
44 hdpData_homeInfo_datePriceChanged 244 non-null float64
45 hdpData_homeInfo_bathrooms 718 non-null float64
46 hdpData_homeInfo_bedrooms 714 non-null float64
47 hdpData_homeInfo_livingArea 715 non-null float64
48 hdpData_homeInfo_homeType 718 non-null object
49 hdpData_homeInfo_homeStatus 718 non-null object
50 hdpData_homeInfo_daysOnZillow 718 non-null int64
51 hdpData_homeInfo_isFeatured 718 non-null bool
52 hdpData_homeInfo_shouldHighlight 718 non-null bool
53 hdpData_homeInfo_zestimate 608 non-null float64
54 hdpData_homeInfo_rentZestimate 642 non-null float64
55 hdpData_homeInfo_listing_sub_type_is_FSBA 664 non-null object
56 hdpData_homeInfo_priceReduction 213 non-null object
57 hdpData_homeInfo_isUnmappable 718 non-null bool
58 hdpData_homeInfo_isPreforeclosureAuction 718 non-null bool
59 hdpData_homeInfo_homeStatusForHDP 718 non-null object
60 hdpData_homeInfo_priceForHDP 718 non-null float64
61 hdpData_homeInfo_priceChange 244 non-null float64
62 hdpData_homeInfo_isNonOwnerOccupied 718 non-null bool
63 hdpData_homeInfo_isPremierBuilder 718 non-null bool
64 hdpData_homeInfo_isZillowOwned 718 non-null bool
65 hdpData_homeInfo_currency 718 non-null object
66 hdpData_homeInfo_country 718 non-null object
67 hdpData_homeInfo_taxAssessedValue 703 non-null float64
68 hdpData_homeInfo_lotAreaValue 714 non-null float64
69 hdpData_homeInfo_lotAreaUnit 714 non-null object
70 hdpData_homeInfo_listing_sub_type_is_openHouse 106 non-null object
71 hdpData_homeInfo_listing_sub_type_is_comingSoon 27 non-null object
72 hdpData_homeInfo_openHouse 106 non-null object
73 hdpData_homeInfo_open_house_info_open_house_showing 106 non-null object
74 variableData 0 non-null float64
75 hdpData_homeInfo_listing_sub_type_is_newHome 26 non-null object
76 hdpData_homeInfo_newConstructionType 26 non-null object
77 hdpData_homeInfo_videoCount 23 non-null float64
78 communityName 21 non-null object
79 isPropertyResultCDP 3 non-null object
80 style 1 non-null object
81 isCdpResult 3 non-null object
82 unitCount 3 non-null float64
83 hdpData_homeInfo_group_type 3 non-null object
84 hdpData_homeInfo_priceSuffix 3 non-null object
85 hdpData_homeInfo_providerListingID 21 non-null float64
86 hdpData_homeInfo_unit 7 non-null object
87 hdpData_homeInfo_listing_sub_type_is_bankOwned 1 non-null object
88 schools_rating 681 non-null float64
89 school_district 681 non-null object
90 num_stories 510 non-null float64
91 year_built 639 non-null float64
92 year_renovated 557 non-null float64
93 sq_ft_finished 637 non-null float64
94 total_sq_ft 639 non-null float64
95 lot_sq_ft 673 non-null float64
96 taxable_land_value 671 non-null float64
97 taxable_improvement_value 641 non-null float64
98 county_name 681 non-null object
99 basement_information_BASEMENT_SQUARE_FEET 491 non-null object
100 basement_information_BASEMENT_FINISH_CODE 307 non-null object
101 basement_information_BASEMENT_TYPE_CODE 353 non-null object
102 bathroom_information_FULL_BATHS 559 non-null float64
103 bathroom_information_HALF_BATHS 210 non-null float64
104 heating_&_cooling_HEATING_TYPE_CODE 625 non-null object
105 room_information_TOTAL_ROOMS 218 non-null float64
106 exterior_information_CONDITION_CODE 226 non-null object
107 exterior_information_STYLE_CODE 333 non-null object
108 exterior_information_CONSTRUCTION_TYPE_CODE 501 non-null object
109 exterior_information_BUILDING_QUALITY_CODE 631 non-null object
110 exterior_information_EXTERIOR_WALL_CODE 220 non-null object
111 exterior_information_ROOF_COVER_CODE 87 non-null object
112 exterior_information_ROOF_TYPE_CODE 121 non-null object
113 exterior_information_BUILDING_CODE 545 non-null object
114 property_information_SUBDIVISION_NAME 638 non-null object
115 property_information_LIVING_SQUARE_FEET 635 non-null object
116 property_information_GROUND_FLOOR_SQUARE_FEET 623 non-null object
117 property_information_BUILDING_SQUARE_FEET 637 non-null object
118 property_information_STORIES_NUMBER 510 non-null float64
119 property_information_STORIES_CODE 496 non-null object
120 property_information_LEGAL_DESCRIPTION 671 non-null object
121 parking_&_garage_information_PARKING_TYPE 599 non-null object
122 parking_&_garage_information_GARAGE_PARKING_SQUARE_FEET 574 non-null object
123 parking_&_garage_information_GARAGE_CODE 578 non-null object
124 lot_information_NUMBER_OF_BUILDINGS 638 non-null float64
125 lot_information_LAND_SQUARE_FOOTAGE 671 non-null object
126 lot_information_LEGAL_LOT_NUMBER 629 non-null object
127 lot_information_LEGAL_BLOCK_NUMBER 436 non-null object
128 lot_information_MUNICIPALITY_NAME 197 non-null object
129 lot_information_ACRES 671 non-null float64
130 lot_information_COUNTY_USE_DESCRIPTION 231 non-null object
131 assessor_information_ASSESSED_YEAR 671 non-null float64
132 assessor_information_TAX_AREA 179 non-null object
133 heating_&_cooling_AIR_CONDITIONING_CODE 219 non-null object
134 fireplace_information_FIREPLACE_INDICATOR 378 non-null object
135 property_information_ADJUSTED_GROSS_SQUARE_FEET 477 non-null object
136 parking_&_garage_information_PARKING_SPACES 6 non-null float64
137 lot_information_ZONING_CODE 282 non-null object
138 lot_information_STATE_USE_DESCRIPTION 445 non-null object
139 bathroom_information_THREE_QUARTERS_BATHS 374 non-null float64
140 fireplace_information_NUMBER_OF_FIREPLACES 360 non-null float64
141 location_information_LOCATION_INFLUENCE_CODE 38 non-null object
142 fireplace_information_FIREPLACE_TYPE_CODE 18 non-null object
143 exterior_information_BUILDING_IMPROVEMENT_CODE 184 non-null object
144 pool_information_POOL_CODE 43 non-null object
145 pool_information_POOL_INDICATOR 47 non-null object
146 lot_information_DEPTH_FOOTAGE 25 non-null float64
147 lot_information_FRONT_FOOTAGE 25 non-null float64
148 bathroom_information_BATH_FIXTURES 86 non-null float64
149 heating_&_cooling_FUEL_CODE 87 non-null object
150 property_information_GROSS_SQUARE_FEET 84 non-null object
151 utility_information_UTILITIES_CODE 87 non-null object
152 property_information_SA_LGL_DSCRPTN 2 non-null object
153 property_information_SA_CONSTRUCTION_CODE 1 non-null object
154 property_information_SA_HEATING_COOLING 1 non-null object
155 property_information_SA_FIREPLACE_CODE 2 non-null object
156 property_information_SA_GARAGE_CARPORT 1 non-null object
157 property_information_SA_NBR_UNITS 1 non-null float64
158 exterior_information_FOUNDATION_CODE 2 non-null object
159 utility_information_SEWER_CODE 3 non-null object
160 utility_information_WATER_CODE 3 non-null object
dtypes: bool(19), float64(46), int64(6), object(90)
memory usage: 810.0+ KB
This shows that that the initial dataset has 718 rows and 161 columns. The dataset is 810 kB in size. Both of these values show that the has sufficient information for further processing.
In this analysis we can see that there are many columns with large numbers of null values, additionally, while we have many features with float64 values, we may need to convert some features to floating point values.
3.3 Data Filtering Parameters
As we start to query the data we will save our notes on the dataset in filter_params. This is meant to capture values that we will use for data cleaning in Section 4. Note that we will use the python convention of using all caps for variables that should not be changed and are global to the file.
filter_params = {}3.4 Initial Filtering/High Level Cleaning
Before we begin our analysis must perform some initial data cleaning to remove obvious shortcomings in our listing data.
3.4.1 Filtering Columns
Our first step will be the removal of columns with large counts of null values. While some of the columns have high quality data, we need both high quality and high quantity data to input into our ML models. To filter this data out we count the number of null values in each column and drop columns that have null value percentages higher than NULL_MAX_PERCENT:
filter_params["NULL_MAX_PERCENT"] = 0.25
null_counts = df.isnull().sum()
columns_to_drop = null_counts[null_counts / len(df) > filter_params["NULL_MAX_PERCENT"]].index
df_before_columns_count = len(df.columns)
df = df.drop(columns_to_drop, axis=1)
df_after_columns_count = len(df.columns)
print(f"Before Filtering Column Count: {df_before_columns_count}")
print(f"After Filtering Column Count: {df_after_columns_count}")
print(f"Column Filtering Removal Count: {df_before_columns_count - df_after_columns_count}")Before Filtering Column Count: 161
After Filtering Column Count: 89
Column Filtering Removal Count: 72
# df.columns[1:]
df.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 718 entries, 0 to 717
Data columns (total 89 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 718 non-null int64
1 zpid 718 non-null int64
2 price 718 non-null object
3 priceLabel 718 non-null object
4 beds 714 non-null float64
5 baths 718 non-null float64
6 area 715 non-null float64
7 statusType 718 non-null object
8 statusText 718 non-null object
9 isFavorite 718 non-null bool
10 isUserClaimingOwner 718 non-null bool
11 isUserConfirmedClaim 718 non-null bool
12 imgSrc 718 non-null object
13 hasImage 718 non-null bool
14 visited 718 non-null bool
15 shouldShowZestimateAsPrice 718 non-null bool
16 detailUrl 718 non-null object
17 pgapt 718 non-null object
18 sgapt 718 non-null object
19 has3DModel 718 non-null bool
20 hasVideo 718 non-null bool
21 isHomeRec 718 non-null bool
22 address 718 non-null object
23 info3String 708 non-null object
24 info1String 708 non-null object
25 brokerName 708 non-null object
26 hasAdditionalAttributions 718 non-null bool
27 isFeaturedListing 718 non-null bool
28 isShowcaseListing 718 non-null bool
29 timeOnZillow 718 non-null int64
30 latLong_latitude 718 non-null float64
31 latLong_longitude 718 non-null float64
32 hdpData_homeInfo_zpid 718 non-null int64
33 hdpData_homeInfo_zipcode 718 non-null int64
34 hdpData_homeInfo_city 718 non-null object
35 hdpData_homeInfo_state 718 non-null object
36 hdpData_homeInfo_latitude 718 non-null float64
37 hdpData_homeInfo_longitude 718 non-null float64
38 hdpData_homeInfo_price 718 non-null float64
39 hdpData_homeInfo_bathrooms 718 non-null float64
40 hdpData_homeInfo_bedrooms 714 non-null float64
41 hdpData_homeInfo_livingArea 715 non-null float64
42 hdpData_homeInfo_homeType 718 non-null object
43 hdpData_homeInfo_homeStatus 718 non-null object
44 hdpData_homeInfo_daysOnZillow 718 non-null int64
45 hdpData_homeInfo_isFeatured 718 non-null bool
46 hdpData_homeInfo_shouldHighlight 718 non-null bool
47 hdpData_homeInfo_zestimate 608 non-null float64
48 hdpData_homeInfo_rentZestimate 642 non-null float64
49 hdpData_homeInfo_listing_sub_type_is_FSBA 664 non-null object
50 hdpData_homeInfo_isUnmappable 718 non-null bool
51 hdpData_homeInfo_isPreforeclosureAuction 718 non-null bool
52 hdpData_homeInfo_homeStatusForHDP 718 non-null object
53 hdpData_homeInfo_priceForHDP 718 non-null float64
54 hdpData_homeInfo_isNonOwnerOccupied 718 non-null bool
55 hdpData_homeInfo_isPremierBuilder 718 non-null bool
56 hdpData_homeInfo_isZillowOwned 718 non-null bool
57 hdpData_homeInfo_currency 718 non-null object
58 hdpData_homeInfo_country 718 non-null object
59 hdpData_homeInfo_taxAssessedValue 703 non-null float64
60 hdpData_homeInfo_lotAreaValue 714 non-null float64
61 hdpData_homeInfo_lotAreaUnit 714 non-null object
62 schools_rating 681 non-null float64
63 school_district 681 non-null object
64 year_built 639 non-null float64
65 year_renovated 557 non-null float64
66 sq_ft_finished 637 non-null float64
67 total_sq_ft 639 non-null float64
68 lot_sq_ft 673 non-null float64
69 taxable_land_value 671 non-null float64
70 taxable_improvement_value 641 non-null float64
71 county_name 681 non-null object
72 bathroom_information_FULL_BATHS 559 non-null float64
73 heating_&_cooling_HEATING_TYPE_CODE 625 non-null object
74 exterior_information_BUILDING_QUALITY_CODE 631 non-null object
75 exterior_information_BUILDING_CODE 545 non-null object
76 property_information_SUBDIVISION_NAME 638 non-null object
77 property_information_LIVING_SQUARE_FEET 635 non-null object
78 property_information_GROUND_FLOOR_SQUARE_FEET 623 non-null object
79 property_information_BUILDING_SQUARE_FEET 637 non-null object
80 property_information_LEGAL_DESCRIPTION 671 non-null object
81 parking_&_garage_information_PARKING_TYPE 599 non-null object
82 parking_&_garage_information_GARAGE_PARKING_SQUARE_FEET 574 non-null object
83 parking_&_garage_information_GARAGE_CODE 578 non-null object
84 lot_information_NUMBER_OF_BUILDINGS 638 non-null float64
85 lot_information_LAND_SQUARE_FOOTAGE 671 non-null object
86 lot_information_LEGAL_LOT_NUMBER 629 non-null object
87 lot_information_ACRES 671 non-null float64
88 assessor_information_ASSESSED_YEAR 671 non-null float64
dtypes: bool(19), float64(28), int64(6), object(36)
memory usage: 406.1+ KB
3.4.2 Filtering Price Rows
Next we will drop rows where the price does not exist:
df = df.dropna(subset='hdpData_homeInfo_price')3.5 Notable Columns
As our dataset has a high number of features we will focus our efforts on the columns that we expect will provide the highest value to our goal of accurate price prediction.
3.5.1 Target Column, hdpData_homeInfo_price
The target column hdpData_homeInfo_price contains real estate listing prices, expressed using floating point numbers as US dollars ($). If the real estate market is operating efficiently this number will properly capture the value of all features of the listing. Zillow is the source for this column.
df["hdpData_homeInfo_price"].info()<class 'pandas.core.series.Series'>
RangeIndex: 718 entries, 0 to 717
Series name: hdpData_homeInfo_price
Non-Null Count Dtype
-------------- -----
718 non-null float64
dtypes: float64(1)
memory usage: 5.7 KB
df['hdpData_homeInfo_price'].describe().apply(lambda x: format(x, 'f'))count 718.000000
mean 1252138.281337
std 1603089.002523
min 0.000000
25% 649900.000000
50% 840000.000000
75% 1275000.000000
max 23799000.000000
Name: hdpData_homeInfo_price, dtype: object
As we can see from the code sections above hdpData_homeInfo_price has 718 values that range from $0 to $2,379,900, with the mean value being ~$ 1,250,000 and the median value being $840,000. We will explore this column more completely when we clean this column in Section 5.2.1.
We do see that the minimum price is $0, which most likely means a null value. Lets notate this in the filter parameters by setting a minimum price. This value is a best guess based on local knowledge:
filter_params["MINIMUM_PRICE"]= 250_0003.5.2 Feature Column, hdpData_homeInfo_livingArea
The feature column hdpData_homeInfo_livingArea contains the measurement of square footage in feet of the living area of the listing (\(\text{ft}^2\)). The source of this column is Zillow. It is a floating point value of type float64. This measurement typically does not include the garage, or unfinished living space. Effectively this number can be interpreted as the interior size of the listing. There is no guarantee that this measurement is absolutely correct, and the seller may have an incentive to inflate this number in an attempt to get a higher price.
df['hdpData_homeInfo_livingArea'].info()<class 'pandas.core.series.Series'>
RangeIndex: 718 entries, 0 to 717
Series name: hdpData_homeInfo_livingArea
Non-Null Count Dtype
-------------- -----
715 non-null float64
dtypes: float64(1)
memory usage: 5.7 KB
df['hdpData_homeInfo_livingArea'].describe().apply(lambda x: format(x, 'f'))count 715.000000
mean 3173.609790
std 2624.123234
min 522.000000
25% 1884.500000
50% 2704.000000
75% 3794.500000
max 50275.000000
Name: hdpData_homeInfo_livingArea, dtype: object
As we can see from the code sections above hdpData_homeInfo_livingArea has 718 values that range from 522 to 50,275, with the mean value being 3,173 and the median value being 2,704. We will explore this column more completely when we clean this column in Section 5.2.3.
We do see an extreme value (~50,000) for the max square footage which should be added to the filter. This value is most likely a error with data entry or listing categorization:
filter_params["MAX_SQFT"]= 100003.5.3 Feature Column, schools_rating
The feature column schools_rating is a rating of schools in the area and is called the GreatSchools Rating. We are including this value as quality of local area schools may have an effect on overall price.
df['schools_rating'].info()<class 'pandas.core.series.Series'>
RangeIndex: 718 entries, 0 to 717
Series name: schools_rating
Non-Null Count Dtype
-------------- -----
681 non-null float64
dtypes: float64(1)
memory usage: 5.7 KB
df['schools_rating'].describe().apply(lambda x: format(x, 'f'))count 681.000000
mean 6.225110
std 1.626391
min 2.400000
25% 4.700000
50% 6.500000
75% 7.300000
max 9.400000
Name: schools_rating, dtype: object
As we can see from the code sections above schools_rating has 681 values that range from 2.4 to 9.4. with the mean value of 6.22 and the median value of 6.5.
This column seems to good quality data, but will require imputation to fill in missing values. We will add this to our filter parameters
filter_params["IMPUTE_COLS"] = ['schools_rating']3.5.4 Feature Column, exterior_information_BUILDING_QUALITY_CODE
The feature column exterior_information_BUILDING_QUALITY_CODE is a categorical value from Redfin that contains string ratings of the building quality. This value lives deep within the Redfin api and no further description is available.
df['exterior_information_BUILDING_QUALITY_CODE'].info()<class 'pandas.core.series.Series'>
RangeIndex: 718 entries, 0 to 717
Series name: exterior_information_BUILDING_QUALITY_CODE
Non-Null Count Dtype
-------------- -----
631 non-null object
dtypes: object(1)
memory usage: 5.7+ KB
df['exterior_information_BUILDING_QUALITY_CODE'].value_counts()exterior_information_BUILDING_QUALITY_CODE
Average 331
Good 186
Fair 67
Above Average 32
Excellent 12
Poor 3
Name: count, dtype: int64
As we can see from the code sections above exterior_information_BUILDING_QUALITY_CODE has 718 values that range from “Poor” to “Excellent”. All listings have a rating.
As this is a categorical value it will need to be encoded for further processing. We will add this note to the filter parameters:
filter_params["ENCODE_COLS"] = ['exterior_information_BUILDING_QUALITY_CODE']4 Data Cleaning
Prior to constructing our ML models, the listing data must be cleaned. The goal of this section is to produce a single data set that is formatted and ready for ML model construction. This process should build a solid foundation for building different types of models. As a reminder we have already remove columns with large numbers of numbers of null values (Section 3.4.1) and removed rows with null prices (Section 3.4.2).
The data cleaning process will require the following steps:
- Eliminate columns that will not be utilized, or alternatively, initially select feature columns
- Simplify column names
- Check for duplicate columns
- Encode columns that contain categorical values
As a reminder we will print out the filter_params dict:
filter_params{'NULL_MAX_PERCENT': 0.25,
'MINIMUM_PRICE': 250000,
'MAX_SQFT': 10000,
'IMPUTE_COLS': ['schools_rating'],
'ENCODE_COLS': ['exterior_information_BUILDING_QUALITY_CODE']}
4.1 DataFrame Setup
df.index = df['zpid']
len(df.columns)89
df.columnsIndex(['Unnamed: 0', 'zpid', 'price', 'priceLabel', 'beds', 'baths', 'area',
'statusType', 'statusText', 'isFavorite', 'isUserClaimingOwner',
'isUserConfirmedClaim', 'imgSrc', 'hasImage', 'visited',
'shouldShowZestimateAsPrice', 'detailUrl', 'pgapt', 'sgapt',
'has3DModel', 'hasVideo', 'isHomeRec', 'address', 'info3String',
'info1String', 'brokerName', 'hasAdditionalAttributions',
'isFeaturedListing', 'isShowcaseListing', 'timeOnZillow',
'latLong_latitude', 'latLong_longitude', 'hdpData_homeInfo_zpid',
'hdpData_homeInfo_zipcode', 'hdpData_homeInfo_city',
'hdpData_homeInfo_state', 'hdpData_homeInfo_latitude',
'hdpData_homeInfo_longitude', 'hdpData_homeInfo_price',
'hdpData_homeInfo_bathrooms', 'hdpData_homeInfo_bedrooms',
'hdpData_homeInfo_livingArea', 'hdpData_homeInfo_homeType',
'hdpData_homeInfo_homeStatus', 'hdpData_homeInfo_daysOnZillow',
'hdpData_homeInfo_isFeatured', 'hdpData_homeInfo_shouldHighlight',
'hdpData_homeInfo_zestimate', 'hdpData_homeInfo_rentZestimate',
'hdpData_homeInfo_listing_sub_type_is_FSBA',
'hdpData_homeInfo_isUnmappable',
'hdpData_homeInfo_isPreforeclosureAuction',
'hdpData_homeInfo_homeStatusForHDP', 'hdpData_homeInfo_priceForHDP',
'hdpData_homeInfo_isNonOwnerOccupied',
'hdpData_homeInfo_isPremierBuilder', 'hdpData_homeInfo_isZillowOwned',
'hdpData_homeInfo_currency', 'hdpData_homeInfo_country',
'hdpData_homeInfo_taxAssessedValue', 'hdpData_homeInfo_lotAreaValue',
'hdpData_homeInfo_lotAreaUnit', 'schools_rating', 'school_district',
'year_built', 'year_renovated', 'sq_ft_finished', 'total_sq_ft',
'lot_sq_ft', 'taxable_land_value', 'taxable_improvement_value',
'county_name', 'bathroom_information_FULL_BATHS',
'heating_&_cooling_HEATING_TYPE_CODE',
'exterior_information_BUILDING_QUALITY_CODE',
'exterior_information_BUILDING_CODE',
'property_information_SUBDIVISION_NAME',
'property_information_LIVING_SQUARE_FEET',
'property_information_GROUND_FLOOR_SQUARE_FEET',
'property_information_BUILDING_SQUARE_FEET',
'property_information_LEGAL_DESCRIPTION',
'parking_&_garage_information_PARKING_TYPE',
'parking_&_garage_information_GARAGE_PARKING_SQUARE_FEET',
'parking_&_garage_information_GARAGE_CODE',
'lot_information_NUMBER_OF_BUILDINGS',
'lot_information_LAND_SQUARE_FOOTAGE',
'lot_information_LEGAL_LOT_NUMBER', 'lot_information_ACRES',
'assessor_information_ASSESSED_YEAR'],
dtype='object')
4.2 Down Selecting Feature Columns
In this section we perform an initial selection of features. The intent is to remove features that will not be utilized.
original_df = df.copy()
df = df[
[
"beds",
"baths",
"latLong_latitude",
"latLong_longitude",
"hdpData_homeInfo_price",
"hdpData_homeInfo_livingArea",
"hdpData_homeInfo_zestimate",
"hdpData_homeInfo_taxAssessedValue",
"schools_rating",
"year_built",
"year_renovated",
"sq_ft_finished",
"total_sq_ft",
"lot_sq_ft",
"taxable_land_value",
"taxable_improvement_value",
"heating_&_cooling_HEATING_TYPE_CODE",
"exterior_information_BUILDING_QUALITY_CODE",
"property_information_SUBDIVISION_NAME",
"property_information_GROUND_FLOOR_SQUARE_FEET",
"property_information_BUILDING_SQUARE_FEET",
"property_information_LEGAL_DESCRIPTION",
"parking_&_garage_information_PARKING_TYPE",
"parking_&_garage_information_GARAGE_PARKING_SQUARE_FEET",
"parking_&_garage_information_GARAGE_CODE",
"lot_information_LAND_SQUARE_FOOTAGE",
]
]
len(df.columns)26
4.3 Simplifying Column Names
prefixes_to_replace = [
"hdpData_homeInfo_",
"latLong_",
"heating_&_cooling_",
"exterior_information_",
"property_information_",
"parking_&_garage_information_",
"lot_information",
]
for prefix in prefixes_to_replace:
df.columns = df.columns.str.replace(prefix, "")
def camel_case(input_string):
"""
Convert a string to camel case.
Parameters
----------
input_string : str
The input string to be converted.
Returns
-------
str
The converted string in camel case.
Notes
-----
If the input string does not contain any underscores, it is returned as is.
The conversion process involves converting the input string to title case and removing non-alphanumeric characters.
Then, the first character is converted to lowercase.
Examples
--------
>>> camel_case("hello_world")
'helloWorld'
>>> camel_case("hello123_world")
'hello123World'
"""
if input_string.count('_') == 0:
return input_string
# Convert the input string to title case and remove non-alphanumeric characters
output = ''.join(x for x in input_string.title() if x.isalnum())
# Convert the first character to lowercase
return output[0].lower() + output[1:]
df.columns = [camel_case(x) for x in df.columns]df.info()<class 'pandas.core.frame.DataFrame'>
Index: 718 entries, 13189295 to 13803132
Data columns (total 26 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 beds 714 non-null float64
1 baths 718 non-null float64
2 latitude 718 non-null float64
3 longitude 718 non-null float64
4 price 718 non-null float64
5 livingArea 715 non-null float64
6 zestimate 608 non-null float64
7 taxAssessedValue 703 non-null float64
8 schoolsRating 681 non-null float64
9 yearBuilt 639 non-null float64
10 yearRenovated 557 non-null float64
11 sqFtFinished 637 non-null float64
12 totalSqFt 639 non-null float64
13 lotSqFt 673 non-null float64
14 taxableLandValue 671 non-null float64
15 taxableImprovementValue 641 non-null float64
16 heatingTypeCode 625 non-null object
17 buildingQualityCode 631 non-null object
18 subdivisionName 638 non-null object
19 groundFloorSquareFeet 623 non-null object
20 buildingSquareFeet 637 non-null object
21 legalDescription 671 non-null object
22 parkingType 599 non-null object
23 garageParkingSquareFeet 574 non-null object
24 garageCode 578 non-null object
25 landSquareFootage 671 non-null object
dtypes: float64(16), object(10)
memory usage: 151.5+ KB
4.4 Column Conversion
There are a few columns that seem like they should have numeric values but show as object. Here we will attempt to convert the type. If we are unsuccessful we will drop the column:
columns_to_convert_to_numeric = ["groundFloorSquareFeet", "buildingSquareFeet",
"garageParkingSquareFeet", "landSquareFootage"]
for col in columns_to_convert_to_numeric:
df[col] = pd.to_numeric(df[col].str.replace(",", ""))
df.info()<class 'pandas.core.frame.DataFrame'>
Index: 718 entries, 13189295 to 13803132
Data columns (total 26 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 beds 714 non-null float64
1 baths 718 non-null float64
2 latitude 718 non-null float64
3 longitude 718 non-null float64
4 price 718 non-null float64
5 livingArea 715 non-null float64
6 zestimate 608 non-null float64
7 taxAssessedValue 703 non-null float64
8 schoolsRating 681 non-null float64
9 yearBuilt 639 non-null float64
10 yearRenovated 557 non-null float64
11 sqFtFinished 637 non-null float64
12 totalSqFt 639 non-null float64
13 lotSqFt 673 non-null float64
14 taxableLandValue 671 non-null float64
15 taxableImprovementValue 641 non-null float64
16 heatingTypeCode 625 non-null object
17 buildingQualityCode 631 non-null object
18 subdivisionName 638 non-null object
19 groundFloorSquareFeet 623 non-null float64
20 buildingSquareFeet 637 non-null float64
21 legalDescription 671 non-null object
22 parkingType 599 non-null object
23 garageParkingSquareFeet 574 non-null float64
24 garageCode 578 non-null object
25 landSquareFootage 671 non-null float64
dtypes: float64(20), object(6)
memory usage: 151.5+ KB
4.5 Checking for Collinearity
To check for duplicate columns we will set a threshold and run a correlation function on each row, saving the column names that are above the threshold:
# def check_collinearity(input_df, correlation_threshold=0.9):
# """
# Calculate collinearity between columns in a DataFrame.
# Parameters:
# input_df (pandas.DataFrame): The input DataFrame to check for collinearity.
# correlation_threshold (float, optional): The threshold value for correlation. Default is 0.9.
# Returns:
# None
# Prints:
# correlated_pairs (list): List of correlated column pairs.
# """
# correlated_pairs = []
# for col in input_df.columns:
# if input_df[col].dtype == "float64":
# corr = input_df.corrwith(input_df[col], numeric_only=True).sort_values(ascending=False)[1:]
# for key, val in corr.to_dict().items():
# if val > correlation_threshold:
# correlated_pairs.append([col, key])
# return correlated_pairs
def calc_collinearity(input_df, correlation_threshold=0.9):
"""
Calculate collinearity between columns in a DataFrame.
Parameters:
input_df (pandas.DataFrame): The input DataFrame.
correlation_threshold (float, optional): The threshold value for determining correlation. Default is 0.9.
Returns:
list: A list of correlated column pairs.
Example:
>>> input_df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
>>> calc_collinearity(input_df)
[['A', 'B']]
"""
correlated_pairs = []
for col in input_df.columns:
if input_df[col].dtype == "float64":
corr = input_df.corrwith(input_df[col], numeric_only=True).sort_values(ascending=False)[1:]
for key, val in corr.to_dict().items():
if val > correlation_threshold:
correlated_pairs.append([col, key])
return correlated_pairs
print(calc_collinearity(df))[['price', 'zestimate'], ['livingArea', 'buildingSquareFeet'], ['zestimate', 'price'], ['sqFtFinished', 'totalSqFt'], ['totalSqFt', 'sqFtFinished'], ['lotSqFt', 'lotSqFt'], ['buildingSquareFeet', 'livingArea'], ['landSquareFootage', 'lotSqFt']]
test_df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
print(calc_collinearity(test_df))[]
From this analysis we can see that many features are collinear. We are going to use both price and the zestimate (Zillow Estimate) to compare our model against, so we will keep these. In the following code section we will remove one of the collinear features
df = df.drop(['sqFtFinished', 'buildingSquareFeet', "landSquareFootage"], axis=1)
calc_collinearity(df)[['price', 'zestimate'], ['zestimate', 'price']]
df.info()<class 'pandas.core.frame.DataFrame'>
Index: 718 entries, 13189295 to 13803132
Data columns (total 23 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 beds 714 non-null float64
1 baths 718 non-null float64
2 latitude 718 non-null float64
3 longitude 718 non-null float64
4 price 718 non-null float64
5 livingArea 715 non-null float64
6 zestimate 608 non-null float64
7 taxAssessedValue 703 non-null float64
8 schoolsRating 681 non-null float64
9 yearBuilt 639 non-null float64
10 yearRenovated 557 non-null float64
11 totalSqFt 639 non-null float64
12 lotSqFt 673 non-null float64
13 taxableLandValue 671 non-null float64
14 taxableImprovementValue 641 non-null float64
15 heatingTypeCode 625 non-null object
16 buildingQualityCode 631 non-null object
17 subdivisionName 638 non-null object
18 groundFloorSquareFeet 623 non-null float64
19 legalDescription 671 non-null object
20 parkingType 599 non-null object
21 garageParkingSquareFeet 574 non-null float64
22 garageCode 578 non-null object
dtypes: float64(17), object(6)
memory usage: 134.6+ KB
4.6 Filtering Categorical Columns
Prior to encoding categorical columns we would like to check their contents and quality. We would like to choose categorical columns with minimal value counts. Initially we will print the values of each column:
for col in df.columns:
if df[col].dtype != "float64":
print(col)
print(df[col].value_counts())
print()heatingTypeCode
heatingTypeCode
Forced Air 450
Warm Air 81
Hot Water 60
Wall Furnace 6
Baseboard 6
Forced Air Gas 5
Radiant 4
Electric 3
Baseboard Hot Water 3
Gravity 2
Furnace 2
Floor/wall Furnace 1
Baseboard Electric 1
Baseboard Electric/Hot Water 1
Name: count, dtype: int64
buildingQualityCode
buildingQualityCode
Average 331
Good 186
Fair 67
Above Average 32
Excellent 12
Poor 3
Name: count, dtype: int64
subdivisionName
subdivisionName
HASKINS STATION 8
CANDELAS FLG 1 8
LEYDEN ROCK SUB FLG 6 7
P T BARNUMS SUB 6
BELLEVIEW & SIMMS FLG 1 5
..
BUSCARELLO LT LINE ADJ 1
LINDA VISTA ACRES 1
CANDELAS FLG 2 AMD 1 1
APPLEWOOD GROVE 1ST FLG 1
TROUTDALE IN THE PINES 1
Name: count, Length: 491, dtype: int64
legalDescription
legalDescription
LOT 8 BLK 6 KEEWAYDIN 1
SECTION 13 TOWNSHIP 04 RANGE 69 QTR SE SUBDIVISIONCD 396200 SUBDIVISIONNAME KELTON HEIGHTS BLOCK 037 LOT SIZE: 36994 TRACT VALUE: .849 1
P T BARNUMS SUB B98 L38 TO 40 INC 1
SECTION 17 TOWNSHIP 04 RANGE 69 SUBDIVISIONCD 308800 SUBDIVISIONNAME GREEN MOUNTAIN VILLAGE FLG # 2 BLOCK 011 LOT 0004 SIZE: 9990 TRACT VALUE: .229 1
SUB:BERKELEY GARDENS BLK:7 DESC: LOTS 25 TO 29 INC 1
..
SECTION 34 TOWNSHIP 03 RANGE 69 QTR NE SUBDIVISIONCD 394800 SUBDIVISIONNAME KAWANEE GARDENS BLOCK 007 LOT 0006 SIZE: 17194 TRACT 00A VALUE: .395 1
SECTION 25 TOWNSHIP 03 RANGE 69 SUBDIVISIONCD 102705 SUBDIVISIONNAME BUSCARELLO LOT LINE ADJ LOT 0002 SIZE: 5815 VALUE: .1335 1
SECTION 28 TOWNSHIP 02 RANGE 69 QTR NW SUBDIVISIONCD 435900 SUBDIVISIONNAME LAKECREST FLG #3 BLOCK 001 LOT 0001 SIZE: 9583 TRACT VALUE: .220 1
SECTION 01 TOWNSHIP 03 RANGE 69 QTR NW SUBDIVISIONCD 454800 SUBDIVISIONNAME LAMAR HEIGHTS FLG # 8 BLOCK LOT 0190 SIZE: 7618 TRACT VALUE: .175 1
SECTION 10 TOWNSHIP 05 RANGE 71 QTR NE SIZE: 21126 VALUE: .485 KEY=107 1
Name: count, Length: 671, dtype: int64
parkingType
parkingType
Attached Garage 428
Detached Garage 62
Detached Frame Garage 24
Basement Garage 22
Detached Masonry Garage 20
Carport 12
On Street 11
Built Under Garage 10
On and Off Street 10
Name: count, dtype: int64
garageCode
garageCode
Attached 428
Detached 62
Detached Frame 24
Basement 22
Detached Masonry 20
Carport 12
Built Under 10
Name: count, dtype: int64
Based on the output we will eliminate legalDescription which contains all unique values, and parkingType which is similar to garageCode.
df = df.drop(['legalDescription', 'parkingType'], axis=1)subdivisionName requires further analysis:
df['subdivisionName'].value_counts()subdivisionName
HASKINS STATION 8
CANDELAS FLG 1 8
LEYDEN ROCK SUB FLG 6 7
P T BARNUMS SUB 6
BELLEVIEW & SIMMS FLG 1 5
..
BUSCARELLO LT LINE ADJ 1
LINDA VISTA ACRES 1
CANDELAS FLG 2 AMD 1 1
APPLEWOOD GROVE 1ST FLG 1
TROUTDALE IN THE PINES 1
Name: count, Length: 491, dtype: int64
Based on this output subdivisionName has too many values for conversion to a category, so we will remove it as well:
df = df.drop(['subdivisionName'], axis=1)We now have 3 categorical values that can be encoded.
4.7 Filtering Unrealistic Prices
Next we will remove listings where the price is unrealistic. This is based on the analysis in Section 3.5.1:
df = df[df['price'] > filter_params["MINIMUM_PRICE"]]4.8 Filtering Square Footage Outliers
Next we will remove listings that are larger than expected. This was first analyzed in Section 3.5.2:
df = df[df['livingArea'] < filter_params["MAX_SQFT"]]Additionally we wil filter out properties that are on lots greater that 5 acres. These are typically farm or agricultural properties and they are not needed in our pricing model.
filter_params['MAX_LOT_SQFT'] = 10 * 43560
df = df[df["lotSqFt"] < filter_params["MAX_LOT_SQFT"]]4.9 Correlation Visualization
To continue to understand the data we will visualize the relationship between columns:
df_float = df.select_dtypes(include=['float64'])
def plot_correlation(input_df, title, annot=True, tick_rot=0, width=12, height=8):
"""
This function plots a correlation heatmap for a given input dataframe.
Args:
input_df (pandas.DataFrame): The input dataframe containing the data for which the correlation heatmap needs to be plotted.
title (str): The title for the correlation heatmap plot.
annot (bool, optional): Whether to annotate the heatmap cells with the correlation values. Defaults to True.
tick_rot (int, optional): The rotation angle for the x-axis tick labels. Defaults to 0.
width (int, optional): The width of the correlation heatmap plot. Defaults to 12.
height (int, optional): The height of the correlation heatmap plot. Defaults to 8.
Returns:
None
"""
# Calculate the correlation matrix
corr = input_df.corr(numeric_only=True)
fig, ax = plt.subplots(figsize=(width, height))
sns.heatmap(
corr,
cmap="vlag",
annot=annot,
xticklabels=corr.columns,
yticklabels=corr.columns,
vmin=-1.0,
vmax=1.0,
fmt=".2f",
cbar=False,
).set(title=title)
# Move x-axis ticks to the top
ax.xaxis.tick_top()
# Remove tick marks
ax.tick_params(length=0)
plt.xticks(rotation=tick_rot)
plt.yticks(rotation=0)
brand_plot()
plt.show()plot_correlation(df, "Initial Correlation Matrix", annot=False, tick_rot=45)In Figure 3 we see that we have a few strong positive correlations and a large amount of weak correlations. Overall this visualization provides strong reinforcement that our data is ready for further work.
4.10 One Hot Encoding (OHE)
One Hot Encoding (OHE) is the process of converting a categorical value into a binary value based on the category. A categorical value with 5 values will be converted to 5 columns of binary values with a 1 in the column that the data corresponds to.
Prior to this conversion it will be helpful to save the column names that are not one hot encoded:
float_cols = [col for col in df.columns if df[col].dtype == 'float64']In the code below we use pandas get_dummies function to OHE our categorical features:
for col in df.columns:
if df[col].dtype != "float64":
one_hot = pd.get_dummies(df[col], prefix=col, drop_first=True)
df = df.drop(col, axis=1)
df = pd.concat([df, one_hot], axis=1)
df.info()<class 'pandas.core.frame.DataFrame'>
Index: 658 entries, 13189295 to 13803132
Data columns (total 41 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 beds 656 non-null float64
1 baths 658 non-null float64
2 latitude 658 non-null float64
3 longitude 658 non-null float64
4 price 658 non-null float64
5 livingArea 658 non-null float64
6 zestimate 565 non-null float64
7 taxAssessedValue 652 non-null float64
8 schoolsRating 658 non-null float64
9 yearBuilt 625 non-null float64
10 yearRenovated 544 non-null float64
11 totalSqFt 624 non-null float64
12 lotSqFt 658 non-null float64
13 taxableLandValue 656 non-null float64
14 taxableImprovementValue 626 non-null float64
15 groundFloorSquareFeet 610 non-null float64
16 garageParkingSquareFeet 560 non-null float64
17 heatingTypeCode_Baseboard Electric 658 non-null bool
18 heatingTypeCode_Baseboard Electric/Hot Water 658 non-null bool
19 heatingTypeCode_Baseboard Hot Water 658 non-null bool
20 heatingTypeCode_Electric 658 non-null bool
21 heatingTypeCode_Floor/wall Furnace 658 non-null bool
22 heatingTypeCode_Forced Air 658 non-null bool
23 heatingTypeCode_Forced Air Gas 658 non-null bool
24 heatingTypeCode_Furnace 658 non-null bool
25 heatingTypeCode_Gravity 658 non-null bool
26 heatingTypeCode_Hot Water 658 non-null bool
27 heatingTypeCode_Radiant 658 non-null bool
28 heatingTypeCode_Wall Furnace 658 non-null bool
29 heatingTypeCode_Warm Air 658 non-null bool
30 buildingQualityCode_Average 658 non-null bool
31 buildingQualityCode_Excellent 658 non-null bool
32 buildingQualityCode_Fair 658 non-null bool
33 buildingQualityCode_Good 658 non-null bool
34 buildingQualityCode_Poor 658 non-null bool
35 garageCode_Basement 658 non-null bool
36 garageCode_Built Under 658 non-null bool
37 garageCode_Carport 658 non-null bool
38 garageCode_Detached 658 non-null bool
39 garageCode_Detached Frame 658 non-null bool
40 garageCode_Detached Masonry 658 non-null bool
dtypes: bool(24), float64(17)
memory usage: 108.0 KB
plot_correlation(df, "OHE Listing Correlation Matrix", annot=False, tick_rot=90, width=16, height=10)Figure 4 includes the correlation relationships of the OHE values. We speculate that may add a small amount of value to to our standard regression models, and may be moderately beneficial for our random forest models.
4.11 Imputation
To ensure a fair comparison between models, we will pass identical data to each model. Some models have distinct requirements for their input values, and some cannot handle missing values. This requires that an imputation be performed on the data. For this we will use the KNNImputer from Pedregosa et al. (2011).
def impute_df(input_df):
"""
Imputes missing values in a DataFrame using KNNImputer.
Args:
input_df (pandas.DataFrame): The DataFrame to impute.
Returns:
pandas.DataFrame: The DataFrame with imputed values.
"""
imputer = KNNImputer()
columns = input_df.columns
input_df = pd.DataFrame(imputer.fit_transform(input_df))
input_df.columns = columns
return input_df
df = impute_df(df)
df.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 658 entries, 0 to 657
Data columns (total 41 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 beds 658 non-null float64
1 baths 658 non-null float64
2 latitude 658 non-null float64
3 longitude 658 non-null float64
4 price 658 non-null float64
5 livingArea 658 non-null float64
6 zestimate 658 non-null float64
7 taxAssessedValue 658 non-null float64
8 schoolsRating 658 non-null float64
9 yearBuilt 658 non-null float64
10 yearRenovated 658 non-null float64
11 totalSqFt 658 non-null float64
12 lotSqFt 658 non-null float64
13 taxableLandValue 658 non-null float64
14 taxableImprovementValue 658 non-null float64
15 groundFloorSquareFeet 658 non-null float64
16 garageParkingSquareFeet 658 non-null float64
17 heatingTypeCode_Baseboard Electric 658 non-null float64
18 heatingTypeCode_Baseboard Electric/Hot Water 658 non-null float64
19 heatingTypeCode_Baseboard Hot Water 658 non-null float64
20 heatingTypeCode_Electric 658 non-null float64
21 heatingTypeCode_Floor/wall Furnace 658 non-null float64
22 heatingTypeCode_Forced Air 658 non-null float64
23 heatingTypeCode_Forced Air Gas 658 non-null float64
24 heatingTypeCode_Furnace 658 non-null float64
25 heatingTypeCode_Gravity 658 non-null float64
26 heatingTypeCode_Hot Water 658 non-null float64
27 heatingTypeCode_Radiant 658 non-null float64
28 heatingTypeCode_Wall Furnace 658 non-null float64
29 heatingTypeCode_Warm Air 658 non-null float64
30 buildingQualityCode_Average 658 non-null float64
31 buildingQualityCode_Excellent 658 non-null float64
32 buildingQualityCode_Fair 658 non-null float64
33 buildingQualityCode_Good 658 non-null float64
34 buildingQualityCode_Poor 658 non-null float64
35 garageCode_Basement 658 non-null float64
36 garageCode_Built Under 658 non-null float64
37 garageCode_Carport 658 non-null float64
38 garageCode_Detached 658 non-null float64
39 garageCode_Detached Frame 658 non-null float64
40 garageCode_Detached Masonry 658 non-null float64
dtypes: float64(41)
memory usage: 210.9 KB
4.12 Building Training and Test DataFrames
Now that we have our data frames we can split them into training and test sets. Our target is going to be price and all other columns are going to be our features. We are using train_test_split to partition the data.
test_size = 0.2
random_state = 42
target = "price"
df = df.sort_index()
y = df[target]
z = df["zestimate"]
lat = df["latitude"]
lng = df["longitude"]
# x = df.drop(["price", "zestimate", "latitude", "longitude"], axis=1)
x = df.drop(["price", "zestimate"], axis=1)
x_focus = df.drop(["price", "zestimate", "latitude", "longitude"], axis=1)
x_train, x_test, y_train, y_test, z_train, z_test, lat_test, lat_train, lng_test, lng_train, all_train, all_test, x_focus_train, x_focus_test = train_test_split(
x, y, z, lat, lng, df, x_focus, test_size=test_size, random_state=random_state
)4.13 Preparation for Model Comparison
To compare models we need to compute metrics for comparison. We have chosen to compute the mean squared error (MSE), the root mean squared error (RMSE), and \(R^2\). As we are using the same dataset for all models we can safely use \(R^2\) as a comparison. The code below calculates these values and saves them in model_stats for comparison and visualization.
execution_times = {}
class ModelMetricsHandler:
"""
A class to store and plot model accuracy statistics.
"""
def __init__(self):
self.model_stats = {
"$R^2$": [],
"Root Mean Squared Error": [],
"Execution Time": [],
}
self.execution_times = {}
def calculate(self, name, i_y_test, i_y_pred, start_time):
"""
Calculate model statistics.
Parameters
----------
name : str
The name of the model.
i_y_test : array-like
The test data.
i_y_pred : array-like
The predicted data.
start_time : number
start time from time.time()
"""
assert len(i_y_test) == len(
i_y_pred
), "Test and prediction array lengths do not match!"
self.model_stats["Execution Time"].append([name, time.time() - start_time])
calc_mean_squared_error = mean_squared_error(i_y_test, i_y_pred)
calc_root_mean_squared_error = mean_squared_error(
i_y_test, i_y_pred, squared=False
)
calc_mean_absolute_percentage_error = mean_absolute_percentage_error(
i_y_test, i_y_pred
)
calc_r2_score = r2_score(i_y_test, i_y_pred)
self.model_stats["Root Mean Squared Error"].append(
[name, calc_root_mean_squared_error]
)
self.model_stats["$R^2$"].append([name, calc_r2_score])
def plot_all(self, fig_width=12, fig_height=12):
"""
Plot model statistics.
Parameters
----------
fig_width : int
The width of the figure.
fig_height : int
The height of the figure.
"""
num_rows = len(self.model_stats.keys())
fig_height = num_rows * 6
num_cols = 1
fig, axes = plt.subplots(
nrows=num_rows, ncols=num_cols, figsize=(fig_width, fig_height)
)
iteration = 1
for metric, values in self.model_stats.items():
labels = [x[0] for x in values]
values = [x[1] for x in values]
plt.subplot(num_rows, num_cols, iteration)
p = plt.bar(labels, values)
plt.title(f"{metric} by Model")
if metric == "$R^2$":
plt.bar_label(p, ["{:.4f}".format(x) for x in values])
else:
plt.bar_label(p, ["{:,}".format(int(x)) for x in values])
plt.xticks(rotation=-45)
iteration += 1
brand_plot()
plt.show()
model_metrics = ModelMetricsHandler()5 Exploratory Data Analysis (EDA)
5.1 Feature Importance
To start our EDA, we must build our knowledge of the available features. One way to visualize the importance of the individual features is to use a random forest algorithm that supports plotting the importance. One such library is XGBoost:
xgb_model = xgb.XGBRegressor(n_jobs=1, booster="gbtree").fit(x_focus_train, y_train)
y_pred = xgb_model.predict(x_focus_test)
fig, ax = plt.subplots(1, 1, figsize=(10, 10))
xgb.plot_importance(xgb_model, title="Feature Importance", ax=ax)
brand_plot()
plt.show()XGBoostFigure 5 illustrates the importance of taxAssessedValue and livingArea, which score the highest. Other floating point features also score preform well, and most binary features score relatively low. Interestingly, beds scored much higher than baths. For now we will keep all these features and perform analysis on the most import columns of data.
5.2 Column Exploration
In this section we will describe and query columns that are immediately relevant to our goal of predicting real estate listing prices.
5.2.1 hdpData_homeInfo_price
This column contains the target variable, price, in floating point format. The following code sections explores this column:
sns.kdeplot(data=df, x="price", bw_adjust=0.2)
sns.rugplot(data=df, x="price")
plt.title("Real Estate Listing - Raw Price Distribution")
brand_plot()hdpData_homeInfo_price column visualized in a KDE plotIn Figure 6 we can see that price is heavily distributed in the bottom fifth of the plot and that there are many outliers that skew the distribution. In the code below we calculate statistical descriptions of this column:
df['price'].describe().apply(lambda x: format(x, 'f'))count 658.000000
mean 1128054.920973
std 974527.892718
min 291000.000000
25% 646583.500000
50% 825000.000000
75% 1200000.000000
max 12995000.000000
Name: price, dtype: object
Now lets zoom in on the data and visualize the prices between the 25% and 75% quantiles:
q1, q3 = df['price'].quantile([0.25, 0.75])
price_zoomed_df = df[df['price'].between(q1, q3)]
sns.kdeplot(data=price_zoomed_df, x="price", bw_adjust=0.2)
sns.rugplot(data=price_zoomed_df, x="price")
plt.title(f"Real Estate Listing - {q1} - {q3} Price Distribution")
brand_plot()price distribution of quantile 1 - quantile 3In Figure 7 we visualize the central range of the pricing data. Here we can clearly see that the prices are not evenly distributed, that prices spike at near the 100K threshold for each price range.
5.2.2 taxAssessedValue
This column contains the feature column taxAssessedValue. This feature is highly correlated with price and ass seen in Section 5.1, this may be our primary feature for our regression models:
sns.kdeplot(data=df, x="taxAssessedValue", bw_adjust=0.2)
sns.rugplot(data=df, x="taxAssessedValue")
plt.title("Real Estate Listing - Raw taxAssessedValue Distribution")
brand_plot()taxAssessedValue feature visualized in a KDE plotIn Figure 8 we see a distribution similar to price with a high peak in the lower range and a long upper tail.
df['taxAssessedValue'].describe().apply(lambda x: format(x, 'f'))count 658.000000
mean 648787.675380
std 550837.345364
min 14120.000000
25% 411342.500000
50% 518617.000000
75% 716900.000000
max 9299100.000000
Name: taxAssessedValue, dtype: object
Let’s visualize the central section of the density data:
q1, q3 = df['taxAssessedValue'].quantile([0.25, 0.75])
price_zoomed_df = df[df['taxAssessedValue'].between(q1, q3)]
sns.kdeplot(data=price_zoomed_df, x="taxAssessedValue", bw_adjust=0.2)
sns.rugplot(data=price_zoomed_df, x="taxAssessedValue")
plt.title(f"Real Estate Listing - {q1} - {q3} taxAssessedValue Distribution")
brand_plot()taxAssessedValue distribution of quantile 1 - quantile 3In Figure 9 we see fewer peaks at the rounded numbers and a smoother distribution. Notice that the range of this chart is $400,000 to $700,000 vs. $600,000 to $1,200,000 for price.
5.2.3 livingArea
This column contains the feature column livingArea. This feature is highly correlated with price and may be another primary feature for our regression models: The following code sections explores this column:
sns.kdeplot(data=df, x="livingArea", bw_adjust=0.2)
sns.rugplot(data=df, x="livingArea")
plt.title("Real Estate Listing - Raw livingArea Distribution")
brand_plot()livingArea feature visualized in a KDE plotIn Figure 10 we find that most listings are between 1,500 and 3,500 square feet. Just like price above there are many outliers that skew the distribution.
df['livingArea'].describe().apply(lambda x: format(x, 'f'))count 658.000000
mean 2969.103343
std 1494.572898
min 522.000000
25% 1896.500000
50% 2678.000000
75% 3725.750000
max 9971.000000
Name: livingArea, dtype: object
Now lets zoom in on the data and visualize the square footage between the 25% and 75% quantiles:
q1, q3 = df['livingArea'].quantile([0.25, 0.75])
price_zoomed_df = df[df['livingArea'].between(q1, q3)]
sns.kdeplot(data=price_zoomed_df, x="livingArea", bw_adjust=0.2)
sns.rugplot(data=price_zoomed_df, x="livingArea")
plt.title(f"Real Estate Listing - {q1} - {q3} livingArea Distribution")
brand_plot()livingArea distribution of quantile 1 - quantile 3In Figure 11 we find no discernible patterns in the density. As expected, their appears to be an even distribution of values in the middle of the feature.
5.3 Clustering Visualization
For our unsupervised ML models our plan is to cluster the data then perform a regression on the clustered data. But how do we choose our clusters? One option is to iterate over many possible clusters and analyze quality of the clusters using the silhouette score. In the code section below we create a function, optmize_clusters that take a dataframe and two column names and a clustering model. It then iterates over a number of clusters and calculates the silhouette score, saving the highest score. In the code section below we build this function:
class ClusterOptimizer:
"""
A class to optimize the number of clusters in a clustering model using silhouette scores.
Attributes:
max_all_silhouette_scores (float): The maximum silhouette score obtained across all iterations.
max_all_silhouette_scores_pair (list): The pair of columns (x_col, y_col) that resulted in the maximum silhouette score.
max_all_silhouette_scores_n_clusters (int): The number of clusters that resulted in the maximum silhouette score.
max_all_model: The clustering model that achieved the maximum silhouette score.
Methods:
optimize_clusters: Optimizes the number of clusters in the model and plots the silhouette scores.
"""
def __init__(self):
self.clear_scores()
def clear_scores(self):
self.max_all_silhouette_scores = -1
self.max_all_silhouette_scores_pair = None
self.max_all_silhouette_scores_n_clusters = 0
self.max_all_model = None
def optimize_clusters(self, input_df, x_col, y_col, model, model_label, min_clusters=3, max_clusters=20):
"""
Optimizes the number of clusters in the given model using silhouette scores.
Args:
input_df (DataFrame): The input DataFrame containing the data to cluster.
x_col (str): The column name for the x-coordinate.
y_col (str): The column name for the y-coordinate.
model (sklearn.base.ClusterMixin): The clustering model to use.
model_label (str): The label for the clustering model.
min_clusters (int): The minimum number of clusters to consider (default: 3).
max_clusters (int): The maximum number of clusters to consider (default: 20).
"""
self.clear_scores()
input_df = input_df.copy()
lat = input_df[x_col].to_numpy()
lng = input_df[y_col].to_numpy()
coords = np.dstack((lat, lng))[0]
silhouette_scores = []
max_silhouette_score = -1
max_silhouette_score_cluster_num = 0
for i in range(min_clusters, max_clusters):
cluster_model = model
cluster_model.n_clusters = i
cluster_model.fit(coords)
this_score = silhouette_score(coords, cluster_model.labels_)
silhouette_scores.append(this_score)
if this_score > max_silhouette_score:
max_silhouette_score = this_score
max_silhouette_score_cluster_num = i
plt.title("Coords Silhouette Score vs n_clusters")
plt.plot(range(min_clusters, max_clusters), silhouette_scores)
brand_plot()
plt.show()
# Update global maximum silhouette score and associated values
if max_silhouette_score > self.max_all_silhouette_scores:
self.max_all_silhouette_scores = max_silhouette_score
self.max_all_silhouette_scores_pair = [x_col, y_col]
self.max_all_silhouette_scores_n_clusters = max_silhouette_score_cluster_num
self.max_all_model = model
# Assign cluster labels to the input DataFrame
best_cluster_model = model
model.n_clusters = max_silhouette_score_cluster_num
best_cluster_model.fit(coords)
input_df["location_cluster"] = best_cluster_model.labels_
# Plot the clustered data
sns.scatterplot(data=input_df, x=x_col, y=y_col, hue='location_cluster', palette="tab10")
plt.legend([], [], frameon=False)
plt.title(f"{model_label} - Best Cluster {x_col} vs. {y_col}\nSS: {max_silhouette_score}, N_Clusters: {max_silhouette_score_cluster_num}")
brand_plot()
plt.show()
cluster_optimizer = ClusterOptimizer()
cluster_optimizer.optimize_clusters(df, "longitude", "latitude", KMeans(init="k-means++", n_init="auto", random_state=42), "KMeans")6 Machine Learning Models
Our ML models will use a general architecture for price prediction. We will use an unsupervised ML algorithm to create a subset of the data, and pass this subset to multiple supervised regression algorithms. The unsupervised ML models will use clustering (kMeans and AgglomerativeClustering) and the principal component analysis (PCA) to subset the data. The Supervised ML models will use the regression models, LinearRegression, AdaBoostRegressor, and XGBRegressor. We will use both root mean squared error (RSME) and \(R^2\) to capture the results from these models. Results for these models will be shared in Section 7.
6.1 Baseline Single Feature Linear Regression
Our first model will be a baseline comparison of vanilla linear regression. We are using the scikit-learn implementation and passing in the “most important” feature taxAssessedValue:
start = time.time()
linear_regression = LinearRegression()
linear_regression.fit(x_train["taxAssessedValue"].values.reshape(-1, 1), y_train)
y_pred = linear_regression.predict(x_test["taxAssessedValue"].values.reshape(-1, 1))
# calc_model_stats(f"LinearRegression", y_test, y_pred)
model_metrics.calculate("LinearRegression", y_test, y_pred, start)
execution_times["LinearRegression"] = {"duration": time.time() - start}6.2 Principal Component Analysis (PCA)
PCA is a dimensionality reduction technique that encodes high-dimensional data into lower dimensional data while retaining the most important information. In the code section below we perform PCA for a range of values, calculating \(R^2\) at each step and capturing the best result. In comparison to the clustering algorithms we will see later in this section, this code is relatively simple. We save the best value of this result as PCA-{iteration}-{model}. In this section we use LinearRegression (LR), AdaBoostRegressor (AB) and XGBRegressor (XG) as our supervised regression models.
"""
Improvements made:
- Added docstrings to the class and methods
- Fixed the indentation and formatting according to the Google Python Style Guide
- Added comments to explain the code logic
- Added missing import statements for necessary libraries
- Fixed the missing self parameter in the get_best_y_pred method
"""
import xgboost as xgb
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import AdaBoostRegressor
from sklearn.metrics import r2_score
class PCAMaximizer:
def __init__(self, x_train, y_train, x_test, y_test):
"""
Initializes the PCAMaximizer class with training and testing data.
Args:
x_train: Training data features.
y_train: Training data labels.
x_test: Testing data features.
y_test: Testing data labels.
"""
self.x_train = x_train
self.y_train = y_train
self.x_test = x_test
self.y_test = y_test
self.best_rsquared = 0
self.best_y_pred = None
self.best_i = None
self.best_model = None
def find_best_model(self):
"""
Finds the best model using PCA and different regression algorithms.
Returns:
The best model name.
"""
for i in range(2, 30):
# Perform PCA on training and testing data
pca = PCA(n_components=i)
x_train_reduced = pca.fit_transform(self.x_train)
x_test_reduced = pca.transform(self.x_test)
# Linear Regression
linear_regression = LinearRegression()
linear_regression.fit(x_train_reduced, self.y_train)
y_pred = linear_regression.predict(x_test_reduced)
calc_r2_score = r2_score(y_pred, self.y_test)
if calc_r2_score > self.best_rsquared:
self.best_rsquared = calc_r2_score
self.best_y_pred = y_pred
self.best_i = i
self.best_model = "LR"
# AdaBoost
adaboost_model = AdaBoostRegressor(random_state=42, n_estimators=100).fit(
x_train_reduced, self.y_train
)
y_pred = adaboost_model.predict(x_test_reduced)
calc_r2_score = r2_score(y_pred, self.y_test)
if calc_r2_score > self.best_rsquared:
self.best_rsquared = calc_r2_score
self.best_y_pred = y_pred
self.best_i = i
self.best_model = "AB"
# XGBoost
booster = "gbtree"
xgb_model = xgb.XGBRegressor(n_jobs=1, booster=booster).fit(x_train_reduced, self.y_train)
y_pred = xgb_model.predict(x_test_reduced)
calc_r2_score = r2_score(y_pred, self.y_test)
if calc_r2_score > self.best_rsquared:
self.best_rsquared = calc_r2_score
self.best_y_pred = y_pred
self.best_i = i
self.best_model = "XG"
return self.best_model
def get_best_y_pred(self):
"""
Returns the best predicted values.
Returns:
The best predicted values.
"""
return self.best_y_pred
start = time.time()
pca_maximizer = PCAMaximizer(x_train, y_train, x_test, y_test)
pca_maximizer.find_best_model()
# calc_model_stats(f"PCA {best_i}-{best_model}", y_test, best_y_pred)
model_metrics.calculate(f"PCA {pca_maximizer.best_i}-{pca_maximizer.best_model}", y_test, pca_maximizer.best_y_pred, start)
# execution_times[f"PCA {pca_maximizer.best_i}-{pca_maximizer.best_model}"] = {"duration": time.time() - start}6.3 Feature Selection
To initiate the cluster modeling process, we start by generating an optimized list of features and saving it in the selected_features_list variable. This list consists of the top-performing features based on their length. In other words, the first item in the list represents the best performing single feature, the second item represents the best performing double feature, and so on. We accomplish this by utilizing the “scikit-learn” library’s SelectKBest module. Furthermore, we employ linear regression to calculate a model using these features and store the outcome in the SelectKBestLR model.
class FeatureSelection:
"""
Use `SelectKBest` to predict on the testing labels
"""
def __init__(self, x_train, y_train, x_test, y_test):
"""
Initialize the class
:param x_train: training data
:param y_train: training labels
:param x_test: testing data
:param y_test: testing labels
"""
self.x_train = x_train
self.y_train = y_train
self.x_test = x_test
self.y_test = y_test
self.best_features = None
self.selected_features = None
self.best_y_pred = None
"""
Select the best features
:return: None
"""
def select(self):
best_rsquared = 0
best_features = None
# Select the best features
best_y_pred = None
selected_features_list = []
# Get the selected features
for i in range(2, len(self.x_train.columns) - 1):
selector = SelectKBest(score_func=f_regression, k=i)
X_reduced = selector.fit_transform(self.x_train, self.y_train)
selected_features = pd.DataFrame(
selector.inverse_transform(X_reduced),
index=self.x_train.index,
columns=self.x_train.columns,
# Get the columns of the selected features
)
# Get the selected features
# Fit the linear regression model
selected_columns = selected_features.columns[selected_features.var() != 0]
selected_features_list.append(list(selected_columns))
# Predict the labels
# Check if the r2 score is better than the previous best
X_reduced = self.x_train[selected_columns]
linear_regression = LinearRegression()
linear_regression.fit(self.x_train[selected_columns], self.y_train)
# Update the best features
y_pred = linear_regression.predict(self.x_test[selected_columns])
# Update the best predicted labels
# Update the list of selected features
calc_r2_score = r2_score(y_pred, self.y_test)
if calc_r2_score > best_rsquared:
best_rsquared = calc_r2_score
best_features = selected_columns
best_y_pred = y_pred
self.best_features = best_features
self.best_y_pred = best_y_pred
self.selected_features = selected_features_list
start = time.time()
feature_selector = FeatureSelection(x_train, y_train, x_test, y_test)
feature_selector.select()
selected_features_list = feature_selector.selected_features
# calc_model_stats("SelectKBestLR", y_test, best_y_pred)
model_metrics.calculate(f"SelectKBestLR", y_test, feature_selector.best_y_pred, start)
execution_times["SelectKBestLR"] = {"duration": time.time() - start}print(feature_selector.selected_features)[['taxAssessedValue', 'taxableLandValue'], ['taxAssessedValue', 'taxableLandValue', 'taxableImprovementValue'], ['taxAssessedValue', 'totalSqFt', 'taxableLandValue', 'taxableImprovementValue'], ['livingArea', 'taxAssessedValue', 'totalSqFt', 'taxableLandValue', 'taxableImprovementValue'], ['baths', 'livingArea', 'taxAssessedValue', 'totalSqFt', 'taxableLandValue', 'taxableImprovementValue'], ['baths', 'livingArea', 'taxAssessedValue', 'totalSqFt', 'taxableLandValue', 'taxableImprovementValue', 'heatingTypeCode_Radiant'], ['baths', 'livingArea', 'taxAssessedValue', 'totalSqFt', 'taxableLandValue', 'taxableImprovementValue', 'heatingTypeCode_Radiant', 'buildingQualityCode_Excellent'], ['baths', 'livingArea', 'taxAssessedValue', 'totalSqFt', 'taxableLandValue', 'taxableImprovementValue', 'garageParkingSquareFeet', 'heatingTypeCode_Radiant', 'buildingQualityCode_Excellent'], ['baths', 'livingArea', 'taxAssessedValue', 'totalSqFt', 'taxableLandValue', 'taxableImprovementValue', 'groundFloorSquareFeet', 'garageParkingSquareFeet', 'heatingTypeCode_Radiant', 'buildingQualityCode_Excellent'], ['baths', 'longitude', 'livingArea', 'taxAssessedValue', 'totalSqFt', 'taxableLandValue', 'taxableImprovementValue', 'groundFloorSquareFeet', 'garageParkingSquareFeet', 'heatingTypeCode_Radiant', 'buildingQualityCode_Excellent'], ['baths', 'longitude', 'livingArea', 'taxAssessedValue', 'totalSqFt', 'lotSqFt', 'taxableLandValue', 'taxableImprovementValue', 'groundFloorSquareFeet', 'garageParkingSquareFeet', 'heatingTypeCode_Radiant', 'buildingQualityCode_Excellent'], ['beds', 'baths', 'longitude', 'livingArea', 'taxAssessedValue', 'totalSqFt', 'lotSqFt', 'taxableLandValue', 'taxableImprovementValue', 'groundFloorSquareFeet', 'garageParkingSquareFeet', 'heatingTypeCode_Radiant', 'buildingQualityCode_Excellent'], ['beds', 'baths', 'longitude', 'livingArea', 'taxAssessedValue', 'totalSqFt', 'lotSqFt', 'taxableLandValue', 'taxableImprovementValue', 'groundFloorSquareFeet', 'garageParkingSquareFeet', 'heatingTypeCode_Radiant', 'buildingQualityCode_Average', 'buildingQualityCode_Excellent'], ['beds', 'baths', 'longitude', 'livingArea', 'taxAssessedValue', 'schoolsRating', 'totalSqFt', 'lotSqFt', 'taxableLandValue', 'taxableImprovementValue', 'groundFloorSquareFeet', 'garageParkingSquareFeet', 'heatingTypeCode_Radiant', 'buildingQualityCode_Average', 'buildingQualityCode_Excellent'], ['beds', 'baths', 'longitude', 'livingArea', 'taxAssessedValue', 'schoolsRating', 'totalSqFt', 'lotSqFt', 'taxableLandValue', 'taxableImprovementValue', 'groundFloorSquareFeet', 'garageParkingSquareFeet', 'heatingTypeCode_Radiant', 'buildingQualityCode_Average', 'buildingQualityCode_Excellent', 'garageCode_Basement'], ['beds', 'baths', 'longitude', 'livingArea', 'taxAssessedValue', 'schoolsRating', 'totalSqFt', 'lotSqFt', 'taxableLandValue', 'taxableImprovementValue', 'groundFloorSquareFeet', 'garageParkingSquareFeet', 'heatingTypeCode_Radiant', 'buildingQualityCode_Average', 'buildingQualityCode_Excellent', 'buildingQualityCode_Good', 'garageCode_Basement'], ['beds', 'baths', 'longitude', 'livingArea', 'taxAssessedValue', 'schoolsRating', 'yearRenovated', 'totalSqFt', 'lotSqFt', 'taxableLandValue', 'taxableImprovementValue', 'groundFloorSquareFeet', 'garageParkingSquareFeet', 'heatingTypeCode_Radiant', 'buildingQualityCode_Average', 'buildingQualityCode_Excellent', 'buildingQualityCode_Good', 'garageCode_Basement'], ['beds', 'baths', 'latitude', 'longitude', 'livingArea', 'taxAssessedValue', 'schoolsRating', 'yearRenovated', 'totalSqFt', 'lotSqFt', 'taxableLandValue', 'taxableImprovementValue', 'groundFloorSquareFeet', 'garageParkingSquareFeet', 'heatingTypeCode_Radiant', 'buildingQualityCode_Average', 'buildingQualityCode_Excellent', 'buildingQualityCode_Good', 'garageCode_Basement'], ['beds', 'baths', 'latitude', 'longitude', 'livingArea', 'taxAssessedValue', 'schoolsRating', 'yearRenovated', 'totalSqFt', 'lotSqFt', 'taxableLandValue', 'taxableImprovementValue', 'groundFloorSquareFeet', 'garageParkingSquareFeet', 'heatingTypeCode_Hot Water', 'heatingTypeCode_Radiant', 'buildingQualityCode_Average', 'buildingQualityCode_Excellent', 'buildingQualityCode_Good', 'garageCode_Basement'], ['beds', 'baths', 'latitude', 'longitude', 'livingArea', 'taxAssessedValue', 'schoolsRating', 'yearRenovated', 'totalSqFt', 'lotSqFt', 'taxableLandValue', 'taxableImprovementValue', 'groundFloorSquareFeet', 'garageParkingSquareFeet', 'heatingTypeCode_Hot Water', 'heatingTypeCode_Radiant', 'buildingQualityCode_Average', 'buildingQualityCode_Excellent', 'buildingQualityCode_Fair', 'buildingQualityCode_Good', 'garageCode_Basement'], ['beds', 'baths', 'latitude', 'longitude', 'livingArea', 'taxAssessedValue', 'schoolsRating', 'yearBuilt', 'yearRenovated', 'totalSqFt', 'lotSqFt', 'taxableLandValue', 'taxableImprovementValue', 'groundFloorSquareFeet', 'garageParkingSquareFeet', 'heatingTypeCode_Hot Water', 'heatingTypeCode_Radiant', 'buildingQualityCode_Average', 'buildingQualityCode_Excellent', 'buildingQualityCode_Fair', 'buildingQualityCode_Good', 'garageCode_Basement'], ['beds', 'baths', 'latitude', 'longitude', 'livingArea', 'taxAssessedValue', 'schoolsRating', 'yearBuilt', 'yearRenovated', 'totalSqFt', 'lotSqFt', 'taxableLandValue', 'taxableImprovementValue', 'groundFloorSquareFeet', 'garageParkingSquareFeet', 'heatingTypeCode_Forced Air', 'heatingTypeCode_Hot Water', 'heatingTypeCode_Radiant', 'buildingQualityCode_Average', 'buildingQualityCode_Excellent', 'buildingQualityCode_Fair', 'buildingQualityCode_Good', 'garageCode_Basement'], ['beds', 'baths', 'latitude', 'longitude', 'livingArea', 'taxAssessedValue', 'schoolsRating', 'yearBuilt', 'yearRenovated', 'totalSqFt', 'lotSqFt', 'taxableLandValue', 'taxableImprovementValue', 'groundFloorSquareFeet', 'garageParkingSquareFeet', 'heatingTypeCode_Forced Air', 'heatingTypeCode_Gravity', 'heatingTypeCode_Hot Water', 'heatingTypeCode_Radiant', 'buildingQualityCode_Average', 'buildingQualityCode_Excellent', 'buildingQualityCode_Fair', 'buildingQualityCode_Good', 'garageCode_Basement'], ['beds', 'baths', 'latitude', 'longitude', 'livingArea', 'taxAssessedValue', 'schoolsRating', 'yearBuilt', 'yearRenovated', 'totalSqFt', 'lotSqFt', 'taxableLandValue', 'taxableImprovementValue', 'groundFloorSquareFeet', 'garageParkingSquareFeet', 'heatingTypeCode_Forced Air', 'heatingTypeCode_Gravity', 'heatingTypeCode_Hot Water', 'heatingTypeCode_Radiant', 'buildingQualityCode_Average', 'buildingQualityCode_Excellent', 'buildingQualityCode_Fair', 'buildingQualityCode_Good', 'garageCode_Basement', 'garageCode_Carport'], ['beds', 'baths', 'latitude', 'longitude', 'livingArea', 'taxAssessedValue', 'schoolsRating', 'yearBuilt', 'yearRenovated', 'totalSqFt', 'lotSqFt', 'taxableLandValue', 'taxableImprovementValue', 'groundFloorSquareFeet', 'garageParkingSquareFeet', 'heatingTypeCode_Forced Air', 'heatingTypeCode_Gravity', 'heatingTypeCode_Hot Water', 'heatingTypeCode_Radiant', 'buildingQualityCode_Average', 'buildingQualityCode_Excellent', 'buildingQualityCode_Fair', 'buildingQualityCode_Good', 'garageCode_Basement', 'garageCode_Carport', 'garageCode_Detached Frame'], ['beds', 'baths', 'latitude', 'longitude', 'livingArea', 'taxAssessedValue', 'schoolsRating', 'yearBuilt', 'yearRenovated', 'totalSqFt', 'lotSqFt', 'taxableLandValue', 'taxableImprovementValue', 'groundFloorSquareFeet', 'garageParkingSquareFeet', 'heatingTypeCode_Forced Air', 'heatingTypeCode_Gravity', 'heatingTypeCode_Hot Water', 'heatingTypeCode_Radiant', 'heatingTypeCode_Warm Air', 'buildingQualityCode_Average', 'buildingQualityCode_Excellent', 'buildingQualityCode_Fair', 'buildingQualityCode_Good', 'garageCode_Basement', 'garageCode_Carport', 'garageCode_Detached Frame'], ['beds', 'baths', 'latitude', 'longitude', 'livingArea', 'taxAssessedValue', 'schoolsRating', 'yearBuilt', 'yearRenovated', 'totalSqFt', 'lotSqFt', 'taxableLandValue', 'taxableImprovementValue', 'groundFloorSquareFeet', 'garageParkingSquareFeet', 'heatingTypeCode_Forced Air', 'heatingTypeCode_Gravity', 'heatingTypeCode_Hot Water', 'heatingTypeCode_Radiant', 'heatingTypeCode_Wall Furnace', 'heatingTypeCode_Warm Air', 'buildingQualityCode_Average', 'buildingQualityCode_Excellent', 'buildingQualityCode_Fair', 'buildingQualityCode_Good', 'garageCode_Basement', 'garageCode_Carport', 'garageCode_Detached Frame'], ['beds', 'baths', 'latitude', 'longitude', 'livingArea', 'taxAssessedValue', 'schoolsRating', 'yearBuilt', 'yearRenovated', 'totalSqFt', 'lotSqFt', 'taxableLandValue', 'taxableImprovementValue', 'groundFloorSquareFeet', 'garageParkingSquareFeet', 'heatingTypeCode_Forced Air', 'heatingTypeCode_Gravity', 'heatingTypeCode_Hot Water', 'heatingTypeCode_Radiant', 'heatingTypeCode_Wall Furnace', 'heatingTypeCode_Warm Air', 'buildingQualityCode_Average', 'buildingQualityCode_Excellent', 'buildingQualityCode_Fair', 'buildingQualityCode_Good', 'buildingQualityCode_Poor', 'garageCode_Basement', 'garageCode_Carport', 'garageCode_Detached Frame'], ['beds', 'baths', 'latitude', 'longitude', 'livingArea', 'taxAssessedValue', 'schoolsRating', 'yearBuilt', 'yearRenovated', 'totalSqFt', 'lotSqFt', 'taxableLandValue', 'taxableImprovementValue', 'groundFloorSquareFeet', 'garageParkingSquareFeet', 'heatingTypeCode_Forced Air', 'heatingTypeCode_Forced Air Gas', 'heatingTypeCode_Gravity', 'heatingTypeCode_Hot Water', 'heatingTypeCode_Radiant', 'heatingTypeCode_Wall Furnace', 'heatingTypeCode_Warm Air', 'buildingQualityCode_Average', 'buildingQualityCode_Excellent', 'buildingQualityCode_Fair', 'buildingQualityCode_Good', 'buildingQualityCode_Poor', 'garageCode_Basement', 'garageCode_Carport', 'garageCode_Detached Frame'], ['beds', 'baths', 'latitude', 'longitude', 'livingArea', 'taxAssessedValue', 'schoolsRating', 'yearBuilt', 'yearRenovated', 'totalSqFt', 'lotSqFt', 'taxableLandValue', 'taxableImprovementValue', 'groundFloorSquareFeet', 'garageParkingSquareFeet', 'heatingTypeCode_Baseboard Hot Water', 'heatingTypeCode_Forced Air', 'heatingTypeCode_Forced Air Gas', 'heatingTypeCode_Gravity', 'heatingTypeCode_Hot Water', 'heatingTypeCode_Radiant', 'heatingTypeCode_Wall Furnace', 'heatingTypeCode_Warm Air', 'buildingQualityCode_Average', 'buildingQualityCode_Excellent', 'buildingQualityCode_Fair', 'buildingQualityCode_Good', 'buildingQualityCode_Poor', 'garageCode_Basement', 'garageCode_Carport', 'garageCode_Detached Frame'], ['beds', 'baths', 'latitude', 'longitude', 'livingArea', 'taxAssessedValue', 'schoolsRating', 'yearBuilt', 'yearRenovated', 'totalSqFt', 'lotSqFt', 'taxableLandValue', 'taxableImprovementValue', 'groundFloorSquareFeet', 'garageParkingSquareFeet', 'heatingTypeCode_Baseboard Hot Water', 'heatingTypeCode_Forced Air', 'heatingTypeCode_Forced Air Gas', 'heatingTypeCode_Furnace', 'heatingTypeCode_Gravity', 'heatingTypeCode_Hot Water', 'heatingTypeCode_Radiant', 'heatingTypeCode_Wall Furnace', 'heatingTypeCode_Warm Air', 'buildingQualityCode_Average', 'buildingQualityCode_Excellent', 'buildingQualityCode_Fair', 'buildingQualityCode_Good', 'buildingQualityCode_Poor', 'garageCode_Basement', 'garageCode_Carport', 'garageCode_Detached Frame'], ['beds', 'baths', 'latitude', 'longitude', 'livingArea', 'taxAssessedValue', 'schoolsRating', 'yearBuilt', 'yearRenovated', 'totalSqFt', 'lotSqFt', 'taxableLandValue', 'taxableImprovementValue', 'groundFloorSquareFeet', 'garageParkingSquareFeet', 'heatingTypeCode_Baseboard Hot Water', 'heatingTypeCode_Forced Air', 'heatingTypeCode_Forced Air Gas', 'heatingTypeCode_Furnace', 'heatingTypeCode_Gravity', 'heatingTypeCode_Hot Water', 'heatingTypeCode_Radiant', 'heatingTypeCode_Wall Furnace', 'heatingTypeCode_Warm Air', 'buildingQualityCode_Average', 'buildingQualityCode_Excellent', 'buildingQualityCode_Fair', 'buildingQualityCode_Good', 'buildingQualityCode_Poor', 'garageCode_Basement', 'garageCode_Carport', 'garageCode_Detached Frame', 'garageCode_Detached Masonry'], ['beds', 'baths', 'latitude', 'longitude', 'livingArea', 'taxAssessedValue', 'schoolsRating', 'yearBuilt', 'yearRenovated', 'totalSqFt', 'lotSqFt', 'taxableLandValue', 'taxableImprovementValue', 'groundFloorSquareFeet', 'garageParkingSquareFeet', 'heatingTypeCode_Baseboard Electric', 'heatingTypeCode_Baseboard Hot Water', 'heatingTypeCode_Forced Air', 'heatingTypeCode_Forced Air Gas', 'heatingTypeCode_Furnace', 'heatingTypeCode_Gravity', 'heatingTypeCode_Hot Water', 'heatingTypeCode_Radiant', 'heatingTypeCode_Wall Furnace', 'heatingTypeCode_Warm Air', 'buildingQualityCode_Average', 'buildingQualityCode_Excellent', 'buildingQualityCode_Fair', 'buildingQualityCode_Good', 'buildingQualityCode_Poor', 'garageCode_Basement', 'garageCode_Carport', 'garageCode_Detached Frame', 'garageCode_Detached Masonry'], ['beds', 'baths', 'latitude', 'longitude', 'livingArea', 'taxAssessedValue', 'schoolsRating', 'yearBuilt', 'yearRenovated', 'totalSqFt', 'lotSqFt', 'taxableLandValue', 'taxableImprovementValue', 'groundFloorSquareFeet', 'garageParkingSquareFeet', 'heatingTypeCode_Baseboard Electric', 'heatingTypeCode_Baseboard Electric/Hot Water', 'heatingTypeCode_Baseboard Hot Water', 'heatingTypeCode_Forced Air', 'heatingTypeCode_Forced Air Gas', 'heatingTypeCode_Furnace', 'heatingTypeCode_Gravity', 'heatingTypeCode_Hot Water', 'heatingTypeCode_Radiant', 'heatingTypeCode_Wall Furnace', 'heatingTypeCode_Warm Air', 'buildingQualityCode_Average', 'buildingQualityCode_Excellent', 'buildingQualityCode_Fair', 'buildingQualityCode_Good', 'buildingQualityCode_Poor', 'garageCode_Basement', 'garageCode_Carport', 'garageCode_Detached Frame', 'garageCode_Detached Masonry'], ['beds', 'baths', 'latitude', 'longitude', 'livingArea', 'taxAssessedValue', 'schoolsRating', 'yearBuilt', 'yearRenovated', 'totalSqFt', 'lotSqFt', 'taxableLandValue', 'taxableImprovementValue', 'groundFloorSquareFeet', 'garageParkingSquareFeet', 'heatingTypeCode_Baseboard Electric', 'heatingTypeCode_Baseboard Electric/Hot Water', 'heatingTypeCode_Baseboard Hot Water', 'heatingTypeCode_Forced Air', 'heatingTypeCode_Forced Air Gas', 'heatingTypeCode_Furnace', 'heatingTypeCode_Gravity', 'heatingTypeCode_Hot Water', 'heatingTypeCode_Radiant', 'heatingTypeCode_Wall Furnace', 'heatingTypeCode_Warm Air', 'buildingQualityCode_Average', 'buildingQualityCode_Excellent', 'buildingQualityCode_Fair', 'buildingQualityCode_Good', 'buildingQualityCode_Poor', 'garageCode_Basement', 'garageCode_Built Under', 'garageCode_Carport', 'garageCode_Detached Frame', 'garageCode_Detached Masonry'], ['beds', 'baths', 'latitude', 'longitude', 'livingArea', 'taxAssessedValue', 'schoolsRating', 'yearBuilt', 'yearRenovated', 'totalSqFt', 'lotSqFt', 'taxableLandValue', 'taxableImprovementValue', 'groundFloorSquareFeet', 'garageParkingSquareFeet', 'heatingTypeCode_Baseboard Electric', 'heatingTypeCode_Baseboard Electric/Hot Water', 'heatingTypeCode_Baseboard Hot Water', 'heatingTypeCode_Floor/wall Furnace', 'heatingTypeCode_Forced Air', 'heatingTypeCode_Forced Air Gas', 'heatingTypeCode_Furnace', 'heatingTypeCode_Gravity', 'heatingTypeCode_Hot Water', 'heatingTypeCode_Radiant', 'heatingTypeCode_Wall Furnace', 'heatingTypeCode_Warm Air', 'buildingQualityCode_Average', 'buildingQualityCode_Excellent', 'buildingQualityCode_Fair', 'buildingQualityCode_Good', 'buildingQualityCode_Poor', 'garageCode_Basement', 'garageCode_Built Under', 'garageCode_Carport', 'garageCode_Detached Frame', 'garageCode_Detached Masonry']]
6.4 Cluster Selection
This section focuses on calculating the highest scoring feature columns for a given number of clusters. We use the silhouette_score as the metric calculating the quality of the clusters. The class ClusterFeatureSelector creates a permutation of column names and iterates over these and the number of clusters and calculates the silhouette_score. We save the best cluster score and feature columns in best_cluster_features and return them from build_by_dimension. It is recommended to keep the dimension argument low (2 or 3) to avoid long run times.
MIN_CLUSTERS = 2
MAX_CLUSTERS = 5
class ClusterFeatureSelector:
def __init__(
self,
input_df,
input_cols,
model,
min_clusters=MIN_CLUSTERS,
max_clusters=MAX_CLUSTERS,
):
cluster_cols = input_cols.copy()
cluster_cols.remove("price")
cluster_cols.remove("zestimate")
cluster_cols.remove("beds")
cluster_cols.remove("baths")
self.cluster_features = cluster_cols
self.input_df = input_df
self.model = model
self.min_clusters = min_clusters
self.max_clusters = max_clusters
def build_by_dimension(self, dimension):
best_cluster_features = {}
feature_combinations = list(
itertools.combinations(self.cluster_features, dimension)
)
for feature_combo in feature_combinations:
input_df = self.input_df.copy()
cluster_input_list = list([input_df[i].to_numpy() for i in feature_combo])
cluster_input = np.dstack(cluster_input_list)[0]
for i in range(self.min_clusters, self.max_clusters + 1):
cluster_model = copy.copy(self.model)
cluster_model.n_clusters = i
cluster_model.fit(cluster_input)
this_score = silhouette_score(cluster_input, cluster_model.labels_)
if i not in best_cluster_features:
best_cluster_features[i] = {"sil_score": 0, "features": None}
if this_score > best_cluster_features[i]["sil_score"]:
best_cluster_features[i]["sil_score"] = this_score
best_cluster_features[i]["features"] = feature_combo
return best_cluster_featuresstart = time.time()
clusterFeatureSelector = ClusterFeatureSelector(df, float_cols, KMeans(init="k-means++", n_init=20, random_state=42))
two_d_cluster_features = clusterFeatureSelector.build_by_dimension(2)
execution_times["TwoDClusterSelection"] = {"duration": time.time() - start}
# three_d_cluster_features = clusterFeatureSelector.build_by_dimension(3)
# print(three_d_cluster_features)6.5 Regression Using Clusters
The ClusterRegression class breaks the dataset up into smaller clusters and runs a regression algorithm on these clusters. In theory this seems like a simple concept, but it introduces many variables in the form of feature selection, cluster selection, and number of clusters. To solve this we pre calculate the best features and clusters in the previous two sections. We use these here to find the optimal cluster.
class ClusterRegression:
def __init__(
self,
train,
test,
target,
cluster_model,
regression_model,
min_clusters=MIN_CLUSTERS,
max_clusters=MAX_CLUSTERS,
):
self.train = train
self.test = test
self.target = target
self.cluster_model = cluster_model
self.regression_model = regression_model
self.min_clusters = min_clusters
self.max_clusters = max_clusters
self.best_rsquared = 0
self.best_regression_features = None
self.best_cluster_features = None
self.best_n_clusters = None
self.best_pred = None
def find_optimal_cluster(self, regression_features, cluster_features):
train_df = self.train.copy()
test_df = self.test.copy()
for regression_feat_cols in regression_features:
for cluster_num in cluster_features.keys():
this_regression_features = regression_feat_cols
this_cluster_features = cluster_features[cluster_num]["features"]
n_clusters = cluster_num
train_cluster_input_list = list([train_df[i].to_numpy() for i in this_cluster_features])
test_cluster_input_list = list([test_df[i].to_numpy() for i in this_cluster_features])
train_cluster_input = np.dstack(train_cluster_input_list)[0]
test_cluster_input = np.dstack(test_cluster_input_list)[0]
this_train_df = train_df.copy()
this_test_df = test_df.copy()
cluster_model = copy.copy(self.cluster_model)
cluster_model.n_clusters = cluster_num
cluster_model.fit(train_cluster_input)
this_train_df["cluster_label"] = cluster_model.labels_
train_labels = cluster_model.labels_
test_labels = cluster_model.fit_predict(test_cluster_input)
this_test_df["cluster_label"] = test_labels
# Build test and train dataframes for each cluster
train_clusters = {label: pd.DataFrame() for label in train_labels}
for key in train_clusters.keys():
train_clusters[key] = this_train_df[:][this_train_df['cluster_label'] == key]
test_clusters = {label: pd.DataFrame() for label in test_labels}
for key in test_clusters.keys():
test_clusters[key] = this_test_df[:][this_test_df['cluster_label'] == key]
test_cluster_df_list = []
for key in train_clusters.keys():
train_cluster_df = train_clusters[key]
test_cluster_df = test_clusters[key]
regression_model = copy.copy(self.regression_model)
regression_model = self.regression_model.fit(train_cluster_df[this_regression_features], train_cluster_df[self.target])
cluster_y_pred = regression_model.predict(test_cluster_df[this_regression_features])
test_cluster_df["y_pred"] = cluster_y_pred
test_cluster_df_list.append(test_cluster_df)
pred_df = pd.concat(test_cluster_df_list)
pred_df = pred_df.sort_index()
test_pred = y_test.sort_index()
rsquared = r2_score(test_pred, pred_df['y_pred'])
if rsquared > self.best_rsquared:
self.best_rsquared = rsquared
self.best_regression_features = this_regression_features
self.best_cluster_features = this_cluster_features
self.best_n_clusters = cluster_num
self.best_pred = pred_df['y_pred']
print(f"Best:\n\tR2: {self.best_rsquared}\n\tRegression: {self.best_regression_features}\n\tCluster: {self.best_cluster_features}\n\tn_clusters: {self.best_n_clusters}")
print()
def get_best_rsquared(self):
return self.best_rsquared
def get_best_pred(self):
return self.best_pred
def get_best_features(self):
return self.best_features
def get_best_clusters(self):
return self.best_clustersstart = time.time()
linear_cluster_regressor = ClusterRegression(all_train, all_test, target, KMeans(init="k-means++", n_init="auto", random_state=42), LinearRegression())
linear_cluster_regressor.find_optimal_cluster(selected_features_list, two_d_cluster_features)
y_pred_lcr = linear_cluster_regressor.get_best_pred()
# calc_model_stats("OptClusterLinReg", y_test.sort_index(), y_pred_lcr)
model_metrics.calculate("OptClusterLinReg", y_test.sort_index(), y_pred_lcr, start)
execution_times["OptClusterLinReg"] = {"duration": time.time() - start}Best:
R2: 0.6979821632696949
Regression: ['beds', 'baths', 'latitude', 'longitude', 'livingArea', 'taxAssessedValue', 'schoolsRating', 'yearRenovated', 'totalSqFt', 'lotSqFt', 'taxableLandValue', 'taxableImprovementValue', 'groundFloorSquareFeet', 'garageParkingSquareFeet', 'heatingTypeCode_Hot Water', 'heatingTypeCode_Radiant', 'buildingQualityCode_Average', 'buildingQualityCode_Excellent', 'buildingQualityCode_Fair', 'buildingQualityCode_Good', 'garageCode_Basement']
Cluster: ('longitude', 'lotSqFt')
n_clusters: 3
start = time.time()
xgb_cluster_regressor = ClusterRegression(all_train, all_test, target, KMeans(init="k-means++", n_init="auto", random_state=42), xgb.XGBRegressor(n_jobs=1, booster="gbtree"))
xgb_cluster_regressor.find_optimal_cluster(selected_features_list, two_d_cluster_features)
y_pred_xgbcr = xgb_cluster_regressor.get_best_pred()
# calc_model_stats("OptClusterXGB", y_test.sort_index(), y_pred_xgbcr)
model_metrics.calculate("OptClusterXGB", y_test.sort_index(), y_pred_xgbcr, start)
execution_times["OptClusterXGB"] = {"duration": time.time() - start}Best:
R2: 0.7275406553596411
Regression: ['baths', 'longitude', 'livingArea', 'taxAssessedValue', 'totalSqFt', 'taxableLandValue', 'taxableImprovementValue', 'groundFloorSquareFeet', 'garageParkingSquareFeet', 'heatingTypeCode_Radiant', 'buildingQualityCode_Excellent']
Cluster: ('longitude', 'taxableLandValue')
n_clusters: 2
6.6 Neural Network
For complete comparison of modeling techniques we are a naive neural network models to see if they offer additional accuracy when compared to the supervised and unsupervised ML algorithms. This is by no means an optimized or tuned version of a neural network.
start = time.time()
input_shape = x_train.shape[1]
# Scale the input data
scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)
# Define the neural network architecture
model = tf.keras.Sequential([
tf.keras.layers.Dense(256, activation='relu', input_shape=(input_shape,)),
tf.keras.layers.Dense(256, activation='relu'),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(1)
])
# Calculate decay steps based on the number of epochs and batch size
decay_steps = len(x_train) // 32 * 50
# Define learning rate schedule
lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(
initial_learning_rate=0.01,
# decay_steps=1000,
decay_steps=decay_steps,
decay_rate=0.9
)
# optimizer = tf.keras.optimizers.Adam(learning_rate=lr_schedule)
optimizer = tf.keras.optimizers.Adam()
# Compile the model
model.compile(optimizer=optimizer, loss='mean_squared_error')
# Train the model with early stopping and reduced learning rate
early_stopping = tf.keras.callbacks.EarlyStopping(patience=5, restore_best_weights=True)
reduce_lr = tf.keras.callbacks.ReduceLROnPlateau(factor=0.1, patience=3)
model.fit(x_train_scaled, y_train, epochs=50, batch_size=32, verbose="auto",
validation_data=(x_test_scaled, y_test),
callbacks=[early_stopping, reduce_lr, tf.keras.callbacks.LearningRateScheduler(lr_schedule)])
# callbacks=[early_stopping, reduce_lr])
# Make predictions on x_test
y_pred = model.predict(x_test_scaled)
# calc_model_stats("NeuralNetwork", y_test, y_pred)
model_metrics.calculate("NeuralNetwork", y_test, y_pred, start)
execution_times["NeuralNetwork"] = {"duration": time.time() - start}2023-12-31 13:20:12.146877: I metal_plugin/src/device/metal_device.cc:1154] Metal device set to: Apple M2 Max
2023-12-31 13:20:12.146893: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 32.00 GB
2023-12-31 13:20:12.146901: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 10.67 GB
2023-12-31 13:20:12.146926: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:303] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2023-12-31 13:20:12.146938: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:269] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
WARNING:absl:At this time, the v2.11+ optimizer `tf.keras.optimizers.Adam` runs slowly on M1/M2 Macs, please use the legacy Keras optimizer instead, located at `tf.keras.optimizers.legacy.Adam`.
WARNING:absl:There is a known slowdown when using v2.11+ Keras optimizers on M1/M2 Macs. Falling back to the legacy Keras optimizer, i.e., `tf.keras.optimizers.legacy.Adam`.
2023-12-31 13:20:12.419514: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2023-12-31 13:20:12.745989: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2023-12-31 13:20:13.564565: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
Epoch 1/50
1/17 [>.............................] - ETA: 4s - loss: 1845081669632.000011/17 [==================>...........] - ETA: 0s - loss: 2157043646464.000017/17 [==============================] - ETA: 0s - loss: 1976414765056.000017/17 [==============================] - 1s 15ms/step - loss: 1976414765056.0000 - val_loss: 3059149176832.0000 - lr: 0.0100
Epoch 2/50
1/17 [>.............................] - ETA: 0s - loss: 1959918436352.000013/17 [=====================>........] - ETA: 0s - loss: 1894479560704.000017/17 [==============================] - 0s 7ms/step - loss: 1837928284160.0000 - val_loss: 2132133937152.0000 - lr: 0.0100
Epoch 3/50
1/17 [>.............................] - ETA: 0s - loss: 1388103860224.000013/17 [=====================>........] - ETA: 0s - loss: 2039861215232.000017/17 [==============================] - 0s 6ms/step - loss: 2378015965184.0000 - val_loss: 2000834265088.0000 - lr: 0.0100
Epoch 4/50
1/17 [>.............................] - ETA: 0s - loss: 1855804276736.000013/17 [=====================>........] - ETA: 0s - loss: 1976046321664.000017/17 [==============================] - 0s 6ms/step - loss: 1898876239872.0000 - val_loss: 2856299528192.0000 - lr: 0.0100
Epoch 5/50
1/17 [>.............................] - ETA: 0s - loss: 6798734000128.000013/17 [=====================>........] - ETA: 0s - loss: 2027325227008.000017/17 [==============================] - 0s 6ms/step - loss: 3904141524992.0000 - val_loss: 22992841080832.0000 - lr: 0.0100
Epoch 6/50
1/17 [>.............................] - ETA: 0s - loss: 14171874787328.000013/17 [=====================>........] - ETA: 0s - loss: 8912616030208.0000 17/17 [==============================] - 0s 6ms/step - loss: 9197906296832.0000 - val_loss: 10479675113472.0000 - lr: 9.9934e-04
Epoch 7/50
1/17 [>.............................] - ETA: 0s - loss: 13085355016192.000013/17 [=====================>........] - ETA: 0s - loss: 10367027642368.000017/17 [==============================] - 0s 6ms/step - loss: 8856421793792.0000 - val_loss: 2820969857024.0000 - lr: 0.0100
Epoch 8/50
1/17 [>.............................] - ETA: 0s - loss: 1415346782208.000013/17 [=====================>........] - ETA: 0s - loss: 3308818006016.000017/17 [==============================] - 0s 6ms/step - loss: 4270040547328.0000 - val_loss: 14229082996736.0000 - lr: 0.0100
1/5 [=====>........................] - ETA: 0s5/5 [==============================] - 0s 2ms/step
7 Results & Analysis
7.1 Model Accuracy Results
Figure 12 compares the results of the model accuracy computations in a bar plot:
model_metrics.plot_all()7.2 Model Accuracy Analysis
In Figure 12 we compare the \(R^2\) and root mean squared error (RSME) as evaluation metrics for all models. These are suited for quantification of error in regression algorithms, and are used as such in this project. Surprisingly, our clustering regression model (OptClusterLinReg) performed the best when compared to the other models. The next best performing model performing model was PCA-2-XG which performed principal component analysis then ran XGBRegressor on the result. Our neural network and linear regression models performed similarly.
7.3 Execution Time Results
exc_times_df = pd.DataFrame(execution_times).T
exc_times_df = exc_times_df.sort_values("duration", ascending=False)
exc_times_df["model"] = exc_times_df.index
fig, ax = plt.subplots(figsize=(12, 8))
sns.barplot(data=exc_times_df, x="model", y="duration", ax=ax, color="steelblue")
plt.title("Model Execution Time Comparison")
plt.ylabel("Duration (seconds)")
plt.xlabel("Model (seconds)")
plt.xticks(rotation=45)
brand_plot()
plt.show()7.4 Execution Time Analysis
In Figure 13 we compare the execution times of all ML models. Here we see that our clustering regression models OptClusterXGB, and OptClusterLinReg execute much slower than other models. While they are more accurate than other models, this accuracy comes at a cost. For these models one has to consider the execution time cost vs the accuracy. For certain environments it may be better to sacrifice accuracy for speed.
8 Conclusion
In this project we sought to test and validate methods for using using the output of unsupervised ML methods as input into supervised ML methods to perform price prediction on real world real estate listings.
As detailed in Section 3 we started by scraping real estate listings from Zillow and supplementing with data from Redfin. In total we scraped 718 listings with 161 columns. In Section 4 the data was cleaned to remove unnecessary and duplicate features. Categorical data was transformed using one hot encoding. The data was then split into training and test data sets. Section 5 visualized feature importance and took a deeper look at the most import features.
In Section 6 we began with simple linear regression and progressed to using our planned unsupervised to supervised modeling approach. We used PCA to reduce the dimensionality of the features, and clustering algorithms to create organized subsets of the data. These subsets were passed to supervised regression algorithms. For a final comparison we added 3 neural network implementations.
Section 7 visualizes and analyzes the model accuracy and model execution time results. The best performing models combine clustering and regression, with the KMeans and LinearRegression performing the best. These clustering and regression while accurate are also slow to execute, especially when compared to the basic linear regression.
Through this project we learned:
- Cleaning techniques for real world real estate listing data
- One Hot Encoding (OHE) using
pandas.get_dummies
- One Hot Encoding (OHE) using
- Integration of unsupervised and supervised into combined models that predict prices
- Combining clusters and regression requires careful tracking of the data
- Feature selection with
SelectKBestprovided a relatively large increase in \(R^2\) for a small increase in execution time:- May be ideal solution for environments that can sacrifice accuracy for speed
- Principal Component Analysis offers a simple API for reducing the complexity of features
- Optimization of feature selection becomes increasingly important with the number of inputs:
ClusterRegressionhas can iterate over feature columns, cluster columns, and number of clusters.- Speed optimization required using optimized algorithms for feature selection
Areas of Exploration:
- Tuning hyperparameters of clustering and regression models
- Testing different regression and clustering models
- Speed optimization
- Dataset Tuning
- Removal of OHE features
Overall, this project provided a solid learning platform to iterate on unsupervised and supervised machine learning implementations. Throughout this project we were forced to confront the realities of using machine learning. Some implementations were slow, some were too complex, and some provided poor accuracy. In the end we were able to follow through on our goal of creating an accurate price prediction algorithm.
9 Python Environment
Below are the versions of python and included libraries used for this project:
Code
import sys
print("Python Version:", sys.version)
# Print module versions: https://stackoverflow.com/a/49199019
import pkg_resources
import types
def get_imports():
for name, val in globals().items():
if isinstance(val, types.ModuleType):
name = val.__name__.split(".")[0]
elif isinstance(val, type):
name = val.__module__.split(".")[0]
poorly_named_packages = {
"PIL": "Pillow",
"sklearn": "scikit-learn"
}
if name in poorly_named_packages.keys():
name = poorly_named_packages[name]
yield name
imports = list(set(get_imports()))
requirements = []
for m in pkg_resources.working_set:
if m.project_name in imports and m.project_name!="pip":
requirements.append((m.project_name, m.version))
for r in requirements:
print("{}=={}".format(*r))Python Version: 3.11.4 (main, Jul 25 2023, 17:36:13) [Clang 14.0.3 (clang-1403.0.22.14.1)]
matplotlib==3.7.2
numpy==1.24.3
pandas==2.1.1
scikit-learn==1.3.0
seaborn==0.13.0
xgboost==2.0.3